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Thanks to ledinobgy, what almost a.i>;)Oi;/ can cm has besn ,rit::i:pi!ed a 
thousandlolcl, and our moral urtderslanding aboul wlial we ought lo do hasn't 
keptpace, ... fbu can lay minefields, smuggle mxtear weapons in suitcases, 
make nerve gas, and drop 'smart bombs' Yilth pinpoitjt accuracy. Also, you 
can arrange to have a hundred dollars a month aulomatloally sent from your 
bank account to provide education tor ten girls in an Islamic country i//ho other- 
wise would not harr) to read and write .... Vbu can use the Internet to orgartize 
ciSzen monitoring of environmental tiazards, or to o*tec* the honesty and per- 
formance of government otfkials - or to spj' on your neighbors. Now, what 
oughtwetodo? 



— Daniel Dennett, 2006 
In: Breakinq the Spell. 
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Abstract 



A significant pan of current Internet attacks originates from hosts that are dis- 
tributed all over llie Internet. However, there Is evidence Oiat mast of these 
hosts are, In fact, concentrated In certahi parts of the hiternet Tills behavior 

it lends to be concentrated in certain areas. In the real world, liigh crime areas 
are usually labeled as "bad neighborhoods". 

The goal of this disieitation is to investigate Bod Neighborhoods on the Jn- 
rernet. The idea behind the Internet Bad Neighborhood concept is diat die 
probability of a host in behaving badly Increases if Its nelghboiing hosts (ie., 
hosts within the same subnetwork) also behave badly. This idea, in turn, can be 
enploiteci 10 Improve current Jnlernet Mcurtfy solntiotii, since it provides an Indi- 
rect approach to predict new sources of attacks (neighboring hosts of malicious 

Sist. sysremask and mukifaceled study on Ike coiKi-naatinn of maUdous hosts on 

dcs of the Internet Bad Neighborhoods, whereas in the second researcii question 
we have focused on how Bad Neighborhood blacklists can be employed to bet- 
ter protect networks against attacks. The approach employed ^ answer both 
questiofis consists in monitoring and analyzing network data (traces, blacklists, 
etc.) obtauied from various real world producdon networks. 

One of the most [mporrrint findings of this dissertation is the verification that 
Internet Bad Kcighboihaods nio s (oal phenomenon, which can be observed 
not onlv ainetu'(,rk prefixes (e g , ,'24, in CIDR notation), but also at different 
and coarser aggregation levels, siLch as Internet Service Providers (ISPs) and 
countries. For example, we l^und that 20 ISPs [out of 42,201 observed in our 
data sets) concentrated almost half of all spainmlng IF addresses. In addition, 
a singie ISP was found having 62H of its IP addresses involved with spam. 
HiIs suggests that ISP-based Bad Neighborhood security mechanisms can be 
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spectjic and Thar rhey nii^hr be Jocared in neighborhoods one would nor imme- 
diately expect. For example, we found Lhal phi^tng Bad Neighborhoods are 
mostly located in the United Stales and other developed nations - since these 

while spam comes from mostly Southern Asia. This hnphes that Bad Neighbor- 
hood based security tool^i should be application-tailored. 

Another finding -j! i]v.s d[sscrLalion is dial Lnternel Bad Neighborhoods are 

S^ot ihc iiidividLi.il IP .iddresses attack odiv once a patticular^'target. while 
up to 90% ot the Bad Neighborhoods attacked more dian once. Consequenfly, 
tills impLes that historical data of Bad Neighborhoods attacks can potentially be 
successfully employed to predict future attacks. 

Overall, we have put the Internet Bad Neighborhoods under scrutiny from 
die point of vlw of the network administrator. We expect that die findings pro- 
vided in this dissertation can serve as a guide for the design of new algorithms 
and solutions to better secure networks. 
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CHAPTER 1 



Introduction 



l\ I work data was transmitted ta the University of Soutliem California's 
1 N Inforaiadon Sciences Insdiute in Los Angeles, 4O0 miles awaj^ To reach 
the destination, however, the data had to travel more than 100,000 miles, 
dirough diree different networks^ ARPANET, die Packet Radio NetwDrk, and 
the Atiaotic Packet Satellite 10. On this day. the widely regarded first true In- 
ternet connecdon was established, setdng a major landmaric on the history of 
die Incemetd. 

From the seminal three netvrorks interconnection, the Inletnet has evolved 

it figures as "a large-scale, highly engineered system" II Oi that interconnects 
more than 800 million hosts, which are used by more than two billion people 
vioridwide O El ■ The influence of the hitemet on society goes way beyond 
the number of users and hosts. As explained by the sociologist Manuel CasteUs, 
"core economics, social, polidcal, and cultural acttvIUes throughout the planet 
are being structured around the internet" and "esclusion from it (the Internet) is 

currently so important for the functiorung of our society that it is actual^ con- 
sidered part of the aitica! (nflustruciure of many countries |l3|. A myriad of 
critical systems, such as banking, traffic, and nanspoitation, heavily rely upon 

Such dependence has made the Internet very attractive for criminal otga- 

and protests can be carried out. One example is die 2007 Estonia Denial of 
Service (DDoS) attacks. In which many websites from Estonian organliations, 

requests and became overloaded, unable to handle legitimate requests IT5I . 



their online banking, access their goveriuDent online services or even read their 
online newspapers |14l ■ Another example of malicious activity on the Internet 
is spam, a misuse of electronic email. It is estimated that hetween 84% and 90% 
of all e-mail messages are spam nowadaj^ (iSl |I2|1 , and behind it, cyber gangs 
nm lucrative operations by selling pharmaceuticals lISl, distributing maKcious 
software (malware), among other illegal activities (UKU, As DDoS attacks, 

losses from SIO bilUon to $87 billion yearly mi- 

Behind these attacks, we typically find a large amount of IP addresses, usu- 
ally distributed all over the world. Some of these attadts are ^en carried out by 

The zombies can be seen as "h(jacked" computers, located at homes, schools, 
and businesses, controlled by the botmaster to carry out malicious activities. 
Figure |n] shows the geographical location of a sample of 1,193 computers be- 
longmg to the botnet Hlus2/Kelihos.B (23, viWch we generate by processing a 
trace file we have obtained from SutfNet C3) . As can be seen, the distribution 
of bots ei:tends to all populated continents. 

Even though the malicious hosts ate distributed all over the world, there Is 
as example Figure |l,2| in which we present the distribution of spammlng hosts 
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rity engineers (analogous to the HYVU) want to reduce ihe incidence of attacks 
on the iDleraet, they should start by tackling networks wtiere attacks are more 
Iraquently originated. If a user (in analogy to the random person m the real 
world example) wants to be safer on the Internet, he/she should avoid [or at 
least Ije much more careful) connectuig to computers located in such networks. 

The list of Bad Neighborhoods, both in the real wortd and on the hilemet, 
are usually compiled into what is popularly known as bta^ktise, which is a form 
of access control mechanism to allow an entity (e.g., users] to access a particular 
resource with exception of diose entities listed 152]. On the Internet, blacklists 

spam |B3|. ^ 

In the real world, some businesses have generated bad neighborhoods black- 
lists with locations they would not operate for security reasons. Poi example, 

London, Manchester, Glasgow, and Birminghani they would not deliver pack- 
ages Ell- Microsoft has recently been granted with a patent for a Global Po- 
sitionhig System (GPS)-based navigation system that allows drivers and pedes- 
trians to avoid routes through neighborhoods having high-aime rates ESI Idle 



)le 10 stattsdcally predicr 



leighbor IP addresses of die sender [f,e., hi 
lave been previously blaeklisted and unifori 
nessage. The probability of a message bein; 



onpTTj Widi this purpose 
idd Neigbborhood concept 
s, the algorithm ebecks if 



ployed to filter out spam ED . the veiy concept was not inve 
details. This dissertation, howeve.; focuses on a rnuliifiiceted in 



isdcs and how to protect a network agair 

hi the followuig, we first present our definition of Bad Neighborhoods m 
Secdong3] Then, ra Section|r2| we present die goal, research quesdons, and 
approach employed in this dissertation. After diat, we summarise in Seciion[l3j 

1^41 Finally, the outline of the dissertation is detailed in Section [T:!] 

1.1 Defining Internet Bad Neighborhoods 



Definition 1. Internet Bad Neighborhood is a set of IP addresses ciuslered ac- 
cording 10 an aggrcgoiion criterion in which a number af IP addresses pafom 
B certain matidaus activity aver a specified period af lime. 

used to duster maHci™fff addresses into Bad Neighborhoods. Different cri- 
teria can be employed for this purpose. The main one is the IP addressing 

network prefixes (e,g, /24, /S, /18, in Classless Inter-Domaln Routing (CIDR) 
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noUtiDn El). Alternative criteria can be employed, such as geographical loca- 
don (e.g., counlries, cilies. as in FigurelOf or also according to the network's 
Autonomous System Number (ASN) iBTTof the Internet Service Provider (ISP). 

Ths number of IP addresses, on the oihet hand, refers to the number of ma- 

emphasi2e that this number might differ from the total number of IP addresses 
in the neighborhood, since some IP addresses within the bad neighborhood 
could actuaUy be "good IP addresses'. For example, an IP-based /24 Bad Neigh- 
borhood, such as 10.10,10.0/24, hasa fhted size of 2S6 IP addresses. However. 

citizens living in such places. 

bad neighborhood is abusing^or condu^g attacks on [e.g^^am, SSH brute 
force attacks, phishing). Therefore, a single host might belong to different Bad 
Neighborhoods that differ in relation to the applicadon. 

Finally, period o/ time refers to the lime irame used to deline a bad neighbor- 
arc expected to change over time - since machines are expected [o get compro- 



1.2 Goal, Approach, and Research Questions 

The goal of this dissertation is lo scrutinize Ihe Bad Neighborhood phenomenon 

tect netVLTorks from Bad Neighborhood attacks. The geueral approach employed 
consists in monitoring and analyzing network data (traces, blacklists, etc.) ob- 
tained from real world production networks- The idea is to analyse such data 
sets and leam how Bad Neighborhoods behave on the Internet, so we can de- 
velop leehniqu^^bat allow network administrators to better^seeure networks. 

• Research Question 1 (RQ 1): What are the characteristics of Internet 

RQ 1 focuses on scruliniziirg Ifie Bad Neighborhood phenomenon, by provid- 
ing an investigation on why it ocoirs on the Internet, how ihey can be found. 
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After scnitiniang the Internet Bad Ne^hborhood phenomenon, we then as- 
work against such bad neighborhoods . To carry out this, we employ binddisting, 
which has been employed as access control method to £Qter out spam sources 
for many years . Alternatives to that would he whitelisting - lists of IP ad- 
dresses that are allowed to use a resource - and grejlistings. WhiteUsdng is not 

with the large number of IP addresses on the Internet. Gre]^isting (in which a 
Ihei; since it is tailored only to spam - while the bad neighborhood definition 



> Kesearch Question 2 (RQ 2): Which blacklists should a network ad- 
Neighborhoods! ^ ^ 

in RQ 2 we focus in providing netwotlu administrators with insighte on how 
to choose bad neighborhood blacklists obtained Irom different sources. More- 
jver, for this RQ, we evaluate how spedHc bad neighborhood blacklists ate in 
relation to an eppiicadon, determining if they can he employed to protect at- 
[aciis to applications they were not originally intended. Finally, we also address 
die temporal attack strategies employed by bad neighborhood in order to deter- 
mine how often blacklists should be updated and provide insights on when Co 



1.3 Contributions 

The contribudon of diis dissertation is to present, to the best of our knowl- 
edge, the first systematic and mulllfocaed study on the Bad Neighborhood phe- 
nomenon on the Internet. By first acknowledging and verilying die Bad Neigh- 
borhoods existence on the Internet, we then scrutinize Internet Bad Neighbor- 
hoods in a multifaceted approach in order to reveal their characteristics and 

tacks originated from Bad Neighborhoods. 
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1.4 Scope and limitations 

The bad neigliborliood concept is aimed at dealing with attacks that employ a 

as other secuiity approaches, it does not cover all types of Internet attacks. For 
example^ highly sophisticated and precisely targeted cybcr-weapons, such as 
StujfNet. are likely to be stealthy as much as possible, and therefore, likely 
nut captured by Bad Neighboriioiid-based security systems {StuiNet is the first 
confltmed cyber-weapon designed by a nation state ||32l. developed to subvett 
itiduEttial systems located at Itanlan uranium enrichment facilities). 

in addition, in this dissertation we evaluate only IPv4 Bad Neighborhoods, 
Ctirrently, IPv5 SI traffic accounts for less than ]%ofthe total ttafflc observed 
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in networtis such as Interae[2 IgJ) and the Amsterdam Internet Exchange Point 
(AMS-IX) ISa. Due to that, IPv6 atlacks remain relativdy rare - only in 2012 
the first IPv6 DEWS attacks were reported With the Increasuig adoption of 
IPv6, we can expect more atraclfs from IPv6 Bad Neighborhoods. To cope with 
that, we present in Appendix |c| an analysis on whai to expect from IPv6 Bad 
Neighborhoods, As we show in Appendixgl die Internet Bad Neighborhoods 

afvast numbM'of valid IPv6 addresses, ^ ^ 



1.5 Dissertation Outline 

Figure ITS] outlines the structure of this dissertation, divided in four pans, each 
of them having a different emphasis on the Internet Bad Neighborhoods phe- 

In Part 0 (Introduction), we present the mlroduction to this dissertadon and 

locate bad neighborhoods on the Internet, and we verify the Bad Neighborhoods 

InPart|ii][Characteristics), we address RQl ("What are the characteristics of 
Internet Bad Neighborhoods?"), by covering Bad Neighborhood aggregation as 
well as theirlocadon, and a case study in which we tailor the Bad Neighborhood 

hi Part [ig (Deftn^ a^sl Bad Neighborhoods), we investigate HQ 2 
CWhich blacklists should a network administrator choose to protect a network 
against attacks from Internet Bad Neighborhoods?"), by showing how a net- 
work admmlslrator can protect the network he/she maintains by employing In- 
ternet Bad Neighborhoods blacklists from different sources and applicadons. In 

Pinally, in Part |W| (Conclusion), we present the conclusions of this disserta- 

FoUowing this structure, we divide Part|I|mto the following chapters: 

• In Chapter [l|- Introduction, we present the mtroduction to this disser- 




issues. In addilion, we carry out an eiperimeDt to verify the Bad Neigh- 
borhoods assumption - proving that it is an worthy idea to predict new 
EQurces of attacks on the Internet Last, we addiesa the epical Issues 
implicated by the Internet Bad Neighborhood concept. 

In Part|IIl we provide three chapters diat investigate the characteristics oF 
Internet Bad Neighborhoods; 

• In Chapter HI- Internet Bad Neighborhoods Aggregation, we propose 
two gpptoache! to aggregate Internet Bad Neighborhoods into network 
prefixes and evaluate them, employing real world data sets. 

• hi Chapter^- Internet Bad Neighborhoods Location, we reveal where 
are the Internet Bad Neighborhoods concentrated - in terms of countries, 
cities. Autonomous Systems 071). and organizations. 

• In Chapler|||- Case Study: spamming Bad Neighborhoods, we take 
spam Bad Neighborhoods as a case study and refine our general definition 
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of Internet Bad Neigh borhooda. 
In Pait|III|we focus on protection against bod neighborhoods, by providing 

• In Chapler|6|- Bad Neighborhood Blacklists from other Sources, we 

deleimine whal is the ben strategy to generate Internet Bad Neighbor- 
hood blacklists: (i) ttusl others or (iO carry out local measuremfnts. 

• to Chapter [7]- Bad Neighborhoods Blacklists from Different Appll- 

Bad Neigh borhodd blacklists obtained irom one applicatton in relation to 
another application. 

t to Chapter Bad Neighborhoods l^mporal Attack Strategies, we 

our their malicious activities.^ p ig any 

hi Part0 we present Chapter [9] - Conclitsion. in which we finalize this 




CHAPTER 2 



Background 



I of Bad Neighborhoods on the Internet in SecHon|27l| Next, we proceed by 

die issues^assoi^ted with each step of the approach in Sections |23{|2j ^en. 
In Section |£| we serutiniie the Bad Neighborhood assumption, and evaluate 
it experimentaUy. Finally, in Section |XI|we discuss the ethical implications as- 
sociated 10 this Bad Neighborhood concept. 



2.1 Why Internet Bad Neighborhoods Exist 

possible reasons: 

1. Some internet Services Providers (ISPs) neglect mollooUi acfivities in tfieir 
ncnvorfcs. 

2. Whenever a host is injected by a niolware, iris more (iliely that this molmire 
is going to succeed m injscting neighboring hosts beton^ng the same badty 
managed network Ann hosts in well managed networks. 

3. Non-technical local factors may contiibute, such as the rate of software 
piracy, kgishtion, cuitiire, economic, edpcaifon (Evel in o counliy. 

The first reason for the existence of Bad Neighborhoods on the Internet is 
that we can expect different ISPs to have security policies differing on effec- 
tiveness. As discussed by Ramachandran et ol. ESi. there are some iSPs that 

IS 



16 



Colo Corp.. When McColo was disconnected from the Interaet by mo of flieir 
upsiream providers (Global Crossing and Hurricans Electric) due to the large 
amount of malware and botneis In their networks IgS), several reports have 

In such "malware tolerant" ISPs, one can also ejtpect also malware to be 
more successfulin infecting other ne^htoring hosts Igj) (sfrond reason) ■ These 

even more the concentration of malicious hosts and occurrence' of BadHoods in 
such ISPs, 

Finally, non-technical local factors (third reoson) may also contribute to the 
BadHood phenomenon. One could expect that ISPs are more likely to neglect 

their countries [e.g., the United States has a specific anti-spam legislation BBI . 
as well as die European Union |^}. In addition, one could espect countries 

Iheiefore more vulnerable softwLe.'^ 

It is important also to mention that there is an economic <frive iiefiiiid these 
assuntpdons. Cyber-gangs continue on carrying out malicious activities on [he 
Internet simply because there is a profitable business model — which is not in 
the scope of this dissertation. On this topic, however, McCoy el oF. havean- 

ptograms and shown that "online sales of counterfeit or unauthorized products 
drive a robust underground advertising industry diat includes eitiail spam[,,,]", 

jamming methods - which provides incentive for having more compromised 
hosts, mostly likely to be observed in the networks of poorly managed ISPs in 




2.2 Finding Internet Bad Neighborhoods 

In the real world, clime statistics are of Importance when deciding tf a neigh- 
borhood should be considered "bad" or not. These statistics are generated by 
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pressed by the vicflms. 

(BadHLJods to the rest of this dissertation). The idea Is to compile statistics per 

Figure|23] summarizes the approach we propose Co find Incemec BadHeods- 

target- After being attacked, the target feeds the attack dslscdon system writh 
information related to the attack (e.g., trace flies) so attacks can he detected. 
These trace flies are processed and the sources of the attacli are identified based 
on the som-ce IP address. In addition, other data might be obtained from the IP 
packets, such as timestamps, number of byles, etc. After that, a blacklist con- 

and used as an input to the aEEregation process, in which sources get aggre- 

as /24, or geographical rafomfation). ^n^^end, a final fioJ/ood ilacfciijil is 
generated [we use the term throughout this dissertaUon to refer to a list of 
malicious Bad Neighborhoods and to differ from traditional blacklists). 

In the next sections we present mote details about each step involved m the 

2.3 Attack Sources and Attribution 

a potential malidou! sauice. Traditionally, desktop/laptops have been the main 
source of attacks on the hiternet. However, we can expect in the near future 
more attacks to be originated from mobile devices [e.g.. smart phones, as to the 



case of the receDtly found Android-based bolnet ISoH I as well as from devices 
that, in the past, were not connected to the Internet and euirently ate (part of 

players, refrigerators, SIP phones, just to mention a few. 

Idend^tog the responsible attacker Cor Ihe attack is referred in the literature 
as anoA otiributfon, that is, "determining the identity or locatton of an attacker 
or an attackefs intermediary" 0 , As defined by Wheeler and Larsen ID , iden- 
Hly may be the attacker's user name, name, alias, or related Informadon asso- 
ciated with the person orchestrating the attacks. Location, on the other hand, 

(e.g., IP address). 

diEGcult on the InterneL In this sense, attackers commonly employ intermediary 
nodes between themselves and the target system. By employing such hosts, 
attackers hide their identity, since IP packets perceived as attacks at the target 
appear to be originated from the intermediary hosts. 

Figure |2!l|illustrates the attack attribution problem. In this figure, solid Unes 
represent network links, and circles M, fl2, and M are the routers connecting 
the attacker to the target. Each router is connected to a local netwoilt (square), 
to which hoses (orange circles) are connected. To illustrate die attribution prob- 
lem, consider that the attacker hi Fisorel2i2|is a botmaster controlling a botnet 
(botnets are currendy one of the major security direats on the Internet - see 
also Appendix 0 for more on this matter). Consider also diat the target Is a 

Instead of attacking dir ecdy the target, the attacker uses another logical 
path {dashed line in Figure^ to hide his original identity First, die attacker 
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eonnecK to a stepping stone rode (St) - which is a host used to redirect the 
connections from the attacker to the Co. the Command and Control center of 
the botnet. Multiple St hosts can be used in this process. After connecting to 
the Ca, the attaclter sends the commands to the command and control iCo), 
which then send the ordeis to a zombie (or a set of zombies, as Zo). which 
ere the machines that actnally cairy out the spam campai^s, endmg op at the 
targeL Optionally, zombies can employ reflector hosts (Ra), which works like 
a proxy between the target and the zombie, hiding the zombie identity At the 
end of the process, the target receives the attack (e.g., a spam message) having 
the source IP address of the lombie Go) or the reflector host (Ha). 

eBt from oriier network features, such as network address translation (NAT), 
which changes the source and destination Seld address of the IF packet header 

source address of the attackers is not used in the routing process, it may be 
easQy forged, which is commonly known as IP spoofing ET). Other techniques 
can also be employed; for a more detailed view on the mattet please refer to 

The approach presented in this dissertation, however, focuses on the awihu- 
tionoflhe (asl AosI in the logical path of the attacks (Zo or Re), hi this sense. Bad 
Neighborhoods are ultimaafy ninerabk netimrks having compromised machines, 

flagged as malicinus might not represent the behavior of the host's owners, who 
actually might be unaware that his/her computer i s inv olved in such attacks 
(we discuss the ethicai implications of this [n Section |2iat. 

the point of view of a network administrator who wants to protect a network 
from malidous sources. For the network administrator, knowing the identity of 
the attacker does not help to better protect the network he/she maintains, since 
blockrag traffic from the attacker IP address to the network the administrator 
maintains does not slop spam messages from originating from Zo or Rain Figure 
[23| In contrast, we see the attribution of the responsible attacker as a task of 
cyber police forces instead. Such type of research is outside the scope of this 
dissertation. 



2.4 "Kirgets 

llie Internet that is vicdm of attacks carried out by malicious sources. TVaditiona] 
examples of targets are serjers and desktop/laptops. However, mobile devices 
(e.g., smart phones, tableBj are also potential targets, as devices such as TV 

As shown in Figure^ the generation of BadHood blacklists is coupled with 
the monilaring of one or more targets. We refer to the resulting BadHood black- 
lists as Targef s BadHood List (TBL), because it lists the neighborhoods attacking 
lliat particular target. This, however, does not imply that the particular target 
has observed all eristing BadHoods. 

To observe more Bad Neighborhoods, one idea is to monitor a large number 

dial all existing BadMoods are listed in the resulting BadHood blacklist. 

One approach to generate a compJels BadHood blacklist would be to monitor 
eveiy single target on the Internet and generate a single BadHood blacklist. This 

^eer size and complexity of the Internet, Monitoring the whole Internet and 
then coordinate efforts to share the resuldng BadHoods imposes challenges that 



2.5 Data Collection and Attack Detection 



In order to locate BadHoods on die internet, we have to obtam network data 
(e.g., traces) and peifoim the attack detection. Several sources of data can be 




TWget-centric data sources: this category encompasses monitoring the 

network administrator might monitor all the network traffic [in PCAP for- 
mat iSa) to an individual server. 

Network-centric data sources' this categoiy covers monitoring the net- 
pie, consider the networl; router (such as an in Figure |23), In this case. 
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After DbtainlnB the data, the next step consists in detecting attacks. In the 

tection System (IDS), for example, are classified according to dte technique 
Employed to detecr attacks. Signature-based IDS compares network traffic to 
pre-deternilned attack patterns, which are popularly know as signatures. Snort 
is an example ofasignature-based IDS liJl, The other type nf IDS are anomaly- 
based IDS, which compare incoming data to a model of normality that describes 
the expected or "nomiar behavior, StaHsrical analysis and Markov models are 
used for anomaly-based IDS 1551 EEl . 

Attacks can also he detected by application servers. For example, the mall 

|57ll . In addition, ftoneypocs, which are essentially systems diat act as traps to 
detect malidoiis activities, can be employed to detect attacks ED. Finally, 

example. OSSEC (SH is a host intrusion detection system (KIDS) diat correlates 
■various log files [e.g.. SSH server, Web servers) to detect attacks. 

In ihia dissertation, ive do not ^wiis on tfte deuclion indf; ralher we rdy 
upon other systems/techniques for this particular purpose. As a consequence, the 
quahiy of ifie BodHood btacklist depends on the momiored daia and uchniques 
employed to detect attacks. Due to that, eirars might occur in Identilying attadu; 

nitimately impacts' the correctness of the resulting /32 blacklist, which is the 
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2.6 Aggregating Hosts into Bad Neighborhoods 

The miiin idea behind the Bad Neighborhood concept is the aggregation cfma- 

lange, city, country, etc. The advantage of doing so is that ic allows one to pre- 
dict attfltis from unfoteseen sources (neighbors) before they occur (see Section 
|B.7i fDt the investlgatioii of this assumption). 

In the left part of tliis figure, consider Ihat i is a target (e.g., a mail server) 

is the number of the host within each drde). 

In this figure, target A is attaclted Ijy hosts 10. 10. 10. 3 and 10. 10.10.4. If 

same figm"e. By employing this blacklist, A could block any new attempts from 
Botft hosts in the future. 

However, if A were to generate a blacklist emplo>1ns the Bad Neighborhood 

consider, as an example, that A aggregates the malicious hosts into /24 preflK. 
Since host! (1 - 4) share the same 3 octets of the IP address (10.10,10), they 
can be aggregated into a single /24 prefix (10.10.10/24), which comprises all 
addresses In the range 10. 10. 10. o-io. 10.10.256. As a consequence, the only 
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enny in the BadHood blacldist wouid be 10 . 10 . 10 . 0/24, as shown on the right 
in Figure jX4| Associated witti this neighborhood would be a numerical VEilue 

a single neighborhood, and judge equally all the 256 hosts in this netblodt in 
esse a new message arrtves. This, in turn, allows A to be protected ftom rmy 
host from die sanie/24, or die same BadHood - which can be seen as a form 
of prediafng new aiiacking source, and not only to reaaitig to obsenred /32 

both consaered malldous, even though they have not yet attacked A - and may 

Table g3]shoi« a sample of a /24 BadHood blacklist It lists Bve randomly 
chosen /24 BadHoods (out of more dian 500,000). For diis BadHood blacklist, 

Netherlands. In the first colunui, the bad neighborhoods ate listed, while In 
(he second we list the number of distinct mahdous sources thar were observed 
sending spam; the number of spam messages Is shown In the third column. For 
example, out of the 256 hosts Usted in neighborhood 41. 254.0.0, 227 have 



2.7 Veriiying the Bad Neighborhood Assumption 

As explained in Chapter [T| previous work has shown that malicious hosts tend 

lead to the Bad Neighborhoods assumption - that BadHoods can provide 

neighboring hosts of malicious ones are more likely to be malicious as well and. 
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ijierefore, more Hkely lu cany out attacks. 

In this seclion, we verify [he Bad Neighborhood assumption. To do so, we 
carry out a case study to determine if Bad Neighbarhoods are useful in predict- 

employed in relation to non-BadHood-based solutions. 

Witli this purpose in mind, we consider a simplistic mail filter, as shown in 
Figure |31 lb verify if a message is spam or not, the mail filter looks up the 
sender IP address (IP) in die hlacUist {Blacklist )L, which can be one of the 
three bladdists BL#l-BL»a in the same figurel. If the sender's IP address is 
found in the blacklist, the message is considered malicious (spam), otherwise is 
considered legitimate (ham). 

We verify the Bad Neighborhood assumption by comparing the performance 
deiivery by the mail filter when using three different blacklist (BL #1. bl #5 . 

CCBL) Eal. As input data, we use the IP addresses of spammers observed 
by die Sectrical Engineering, Mathematics, and Computer Science Faculty of 
the University of Hvente (UT/KWI), for a period fmm November 11 di to 23rd, 
20n. The list of spammers was extracted from the log flies of SpamAssassin 

In Section |2.7.1| we present more details about the generated blacklists, 
whQe in Section |2.7.2 |we show the UT/EWI data set used as input. Finally, 
in Section |2.7.3l we present and discuss the results. 
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2.7.1 Blacklists Evaluated 

In the previous secdon we have slated that we evaluated Ihtee blackiisls [o 
verily our assumption. Many criteria can be used to generate blacklists, and 
there are many blacklists publicly available on the Internet (a comparison of 
blacHists ftora various sources is covered in Chapier|g. 

We have chosen to employ Ore Composite Blacklist (CBL) (£fl| as the stan- 

CBL was selected because (i) it has been previously investigated by academic 
and Iniernet security communities liSilSlHISlESI, and (ii) the CBL pro- 
vides hullt-access to the blacklist data, to ensure we have a complete view of the 
malidous IP addresses. We call this blacklist CBL32-SrD. We have downloaded 
CBL once a day, and we therefore have one blacklist per day; we used the same 
monitoring period for both blacklists and the incoming mall. 

From the standard blacHist (CBL32-3TD), we create, for each day, a /24 
Bad Neighborhood list, as described in Section |23| We refer to Oiis blacklist 

blacklist against CBL3S-STB, we cL Dbserve the unprovement on the detection 
rate incurred by using the Internet Bad Neighborhood concept. 

Howevei; as also discirssed in Section ^yfil by using the BadHood concept, we 

only observed a single host belonging to the neighborhood. In practice, it is as if 
we have blacklisted all /32 (2S6 addresses) from each /24 preflx. Therefore, by 
comparing direclly the performance of CBL-BadHood24 h'st to the CBL32-STD list 
Is nor a fair approach, since CfiL-BBdHaod24 Itsts, equivalentty, 256 tim^ more 
hosts than CBL32-STD (in the worst case scenario, of one malicious addresses 

lb creare a fair comparison, we create a third blacklist, to whidi we refer 
B5 CBL32-E(]0IV!4-MD. The idea behind this blacklist is create an /32 bladdist 
"with an equivalent number of hosts as CBL-BadHood24, and, tiierefore, pro- 
vide a fair comparison. Howevei; instead of using the Gad Neighborhood as- 
surapiion (that a host is more likely to be malicious if its neighboring hosts are 

CHL32-EQUIV24-RND fs similar to the one delivered by CBL-BBitHooa24, then the 
To illustrate how the blacklist generatjon works, consider Table |23| ' . On 
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November 10, for example, we have obtained aftei aggregating CBL blacklist 
into 724,815,688 Bad Neighborhoods. This, m turn, is equivalent to 205,842,688 
/32 hosts C/32 Equivalenl, hy raulUpiying each BadHood by 256, which is the 
number of hosts m a /24 BadHood El) . 

list [one per day) using two steps: 

• Use alluiputsofCBL32-STD for the same day (to make sure that these /32 
hosts are also blocked here) 

• Add 3: new random /32 IP addresses to the list, where ,f Is shown in the 
Diff in ■ftble|5|2|for the particular day 

hi domg so, we create a blacklist that is able, in terms of 732 entries, to 
block the same number of hosts as CBL-BadHood24. It is the difference between 
the performance of CBL-BH<Ufood24 and CBL32-E0UIU24-FfflD that tests the Bad 
Neighborhoods assumptlan, by comparing If a BndflDad-bnsni blacklist Is more 
efficient than a randomly generated one, providing that both of them are able 
to block the same number of hosts. 

gram. We have only considered valid unicast 78 prefixes in ihia process 
we have not considered 12778, multicast addresses, reserved 78, as described 
hy lANA (53). We have employed the method neitlntdnt) available m 
the java.ocil.Bandon API, which generates "a pseudo-random, uniformly dis- 
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sive)" 153. ^ P ■ * 

Even though we run the risk that some of addresses generated by our pro- 
gram may nor have been in use (e.g., nor aliocated by the RlEs), we e^ct 
rhat rhey will reptesenr iess than 10*6 of the rotal^. Currently, there are 13.43 
/8-equivalent IP addresses hi the RIRs reserved pool. This corresponds to 8.6% 
of the 156.71 /B-equivalent addresses allocated. To compensate for this, we 
repeat the espeiiment 10 times and present the average results. 



2.7.2 Incoming Mall 

In order to evaluate the performance of the mail filter shown in Figure [zsl we 
need three blacklists [already described) and mcoming mail. We have consid- 
ered, for rhis case rhe uicoining maD of the Electrical Enghieering and Computer 
Science Department of the University of TWente OIT/EWI), from November 11 
to 23, 2011. The mail has been previously analyzed using SpamAssassin Mi, 
and we have obtained the IP addresses of the malicious sources. 

have sent at least one spam message. T^ble {2. 3| presents more details about the 

spam messages a day, over the monitoring period. 



2.7.3 Mail Filter Petrformance Evaluation 

In this section, we compare the blacklists described in Section ^TTT] to the 
UT/EWl data set, as shown in in 'ftble |2]3| hi this comparison, we impose a 
day difference between the incoming mail blacklist (UT/EWI) and the black- 
lists BLl-3 le.g., we compare November lldi UT/EWi spam addresses to die 
blacHists generated based on November lOrhl. 

Figure |23| shows the results of our evaluation (also shown in TSblegg. 
We can see that die peifonnance of CBL3Z-STD (by usmg die previous day CBL 
original /32 blacklist) allows us to bloclt, on average, 54.33% of all the spam 
messages observed by UT/EWl, for each day. That means that by employmg a 
blacklist containing individual hosts observed by CBL, we are able to filter out 
roughly half of UT/EWl spam, regardless the day. These results are to be used 
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spam detection. That means that blockins all the individual hosts hsted on CBL 
and tneir nsignBormg nosis, we have been aeie to Olocs moat ot U"l/tmi spam. 

Howevei; to make ir a fair comparison, we have to compare this results 
to the performance of CBL32-E0UIV24-RNB, as discussed in Section |Z.7.I| We 
have then generated the 10 random CBL32-EaDIV24-HHD blacklist! m order 
to eliminate statistical uncertainty. In Figure |3.6t the vertical bars an each 
pomt on CBL32-EI(UIV24-E1ND represents the stEmdard deviation obtainmg from 
running 10 dmes the aigorithni (also shown in "fable |2.41- As can be seen, 
this blacklist performs far worse than CBL-BadHood24 , delivermg an average 
perfnimance of 55.64% spam detection. That means that even though both 
blacklists conlam an equivalent number of /32 entries CCBL33-Eljniva4-Km) 
and CBL-BadHood24), blocking randomly chosen hosts will not significantly im- 
prove spam deletion CCBL32-STD to CBLa3-EQDIV!4-RIiD do yield comparable 
results). 

The same conclusion can be drawn from the curve bAndOH-32, which is a 
/32 blacWist that has the same number of entries as CBL32-STD. The difference, 
however, is [hat they were all randomly generated (instead of being observed 
spammlng CBL infrastructures). As can be seen, [he efiective detection rate of 
randomly generated IP addresses is very low for /32 blacklist, [hat is, abnost 
U%, 
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detection (20S million, in average or 4.7% of [he maximum theoretical in the 
IPv4 address apace). On the other hand, idecting ncightorinE ftosn o/raoFkioiu 
one! to be blacklisled as well significantly imptoved the results. Therefore, we 
can conclude that Sad NeightorftooiJs are a much hetier approach to predict new 
sources of attack, when compared to random IP addresses - which validates the 
Bad Neighborhood assumption. This, in turn, supports the rest of the worli pre- 

in EZlEfflUSlHIllIlJ, which were covfred in C^aplerlll ^ 



2.8 Ethics and Internet Bad Neighborhoods 

According to the University of Tennessee's Internet Encyclopedia of Philosophy, 
Ethics (or mora] philosophy) is a field of philosophy that "involves systematiz- 
ing, defending, and recommending concepts of right and wrong behavior" |62|. 

actual deployment of any technology, not only to ultimately obtain a more "Eth- 
ical product^', but also to provoke a rellection on the impact of the technologies 
on individuals, society, and the envirormient. 

As asserted in 1961 by the cybernetics pioneer Norbert Wiener, "individuals 
developing interactive technologies have an ethical responsibility to lake likely 
consequences, positive and negative, of their designs into account" IE El . 
The members of the histitute of Electrical and Electronics Engineers (IEEE) have 
also fl code of Ethics to be followed 122) , and the first article states that members 
"accept responslbiliiy in making decisions consistent with the safety, health, and 
welfare of the public, and to disclose prompdy fectors that might endanger the 
public or the environment". 

is not on the Ethical aspects associated with Bad Neighborhoods, we do provide 
on mxroSaOion to the Btftiml issue! involved in the research and deployment 
of BadHood-based technologies. We recommend that before implementing the 
flndings obtaining in this dissertation, the responsible persons should carry a 

To assist us m the Ethical evaluation, we have carried out a series of inter- 
views and discussions with Dr. Ahnee van Wynsberghe^, the Ethical Adviser of 
die Centre for Ifelemadcs and Information Tfechnology' (CTIT) of die University 
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2.8.2 Three Dimensions of the Ethical Issues and BadHoods 

In this subsection we address three dimensions of Ethics and Internet Bad Neigh- 
barhoods research. We first explain each process, the ethical issues assQciated, 
and the proposed elhlcal solution. 

Labeling Malicious Hosts 

We have summarized in FigureET] the approach we employ to find Internet Bad 
Ne^hborhoods, In Section |2.5| wiehave covered how the data collection and 
attack detection Is performed. This is carried out ta detect Individual malldous 
hosts [/32) carrying out malicious activities. 

Even though the defection of attacks on the Internet is somehow effective, 
the process itself is not error-prone. There is not currently a technology that 
is able to detect 100% of the security incident a. More ovei, Jnfse positives pose 
a threat to ^'mess value discussed in Section |za.l| if a system (e.g., a net- 
Woik intrusion detection system) wrongly classifies a certain network flow as 
malicious, the IP address assigned to It (and consequendy, the user that belongs 
to this computer] will be taken into the blacklist, and, uliimaiely, into a Bad 
Neighboihood. 

Not only that, the informadon used to create the /32 blacklist is the source 
IP address of the last "hop" host [Figure However, the veiy IP address 
might be forged, and, consequently, many IP address that are not carrying out 
attacks may end up in the blacldist, Vbiiich also violates the fairness value. 

Finally, as shown in Figure ^ usuaUy the IP address that is perceived as 
malicious is the last hop in a cham of many, which hides the original identity of 
the attackeL If the user is unaware his/her computer is being exploited to carry 
Dirt malicious activities, is he/she responsible? lb which degree? And to which 
degree software developers should be beld responsible for releasing vulnerable 

TSking all these issues into account, we acknowledge that the labeling pro- 
cess of Individual hosts violates values such as fairness. However, blackUst based 
technologies have been used to filler spam since 1997 to provide other values: 
security to the users and ISPs and economic value (reducing potential dam- 
age caused by attacks). It is supported by the industry community in various 
products, such as SpajnABBaaBln mail filter IE3 and blacklists providet^ such 
as SpamHaus ISES[Z3I as well as by the research community ESI, to men- 
don a few. Therefore, our answer to this ediical dimension is that this is Ihe 
"best-eflTirf approach, that is, provides a compromise of values (analogous to 




LflbeliT^Bad Neighborhaads 

We have summarized in Figure|22|the aggresation of malicious hoslE (/32) in 



individual liosE into Ead Neighborhoods, we actually loose the ahility of telling 
which hosts within the neighborhood are malicious, and actually judge them 
"equally" had. 

As discussed m Chapter[Il "in the real world, locations having higher crime 
rates than average ate sometimes called bad neighborhoods. In such places, It Is 
statistically more likdv that a crime will occur compared to other locations. The 
same principle holds for Internet Bad Neighborhondst it is more likely that ma- 
licious activities are originating Irom such networks than from other networks". 

In this particular case, hy aggregating Individual hosB info BadHooda, we 
actually violate some ethical principles, by introducing btas and prejudice to- 
wards the other IP addresses in the duster (which are associated to Individual 
computers and, ultimately, users behind the desks). Such IP addresses may have 

neighboring IP addresses. 

As an analogy to the real world, consider the scenario in which a bank has a 
list ofclients that it would deny services due to previous problems in the past !f 
the bank would also include in this blacklist all the immediate neighbors of the 
ones previously listed, then the bank would have a Bad Neighborhood blacklist 

We also acknowledge that the labeling and the aggregation into BadHoods 
is not ethically correct. However, this technique has proved to be effective 
and able to provide other values; security and economic values. As we have 
shown In Section (273) for IP-based BadHoods. such labeling has proven that 

than randomly chosen neighborhoods of the same size. 



Part II 



Bad Neighborhoods 
Characteristics 



Internet Bad Neighborhoods Aggregation^ 



THE Bad Neighbortiood concept is based on the assumption that malicious 
hosts rend to be concenttaled in certain networks instead of being ev enly 
distributed over the entire IP address space (as verified in Section g7| 
andin ETHUgBEniEIl)- In SectionlOl we have shown how to agjjregate 
individual hosts into /24 Bad Neighborhood (in ODR notation EH)- However, 

than /24 (e.g., /20j /IS, /12, etc.), and what prefixes suit best to express Bad 
Neighborhoods {BadHoods hereafter), given a certain data set, 

the real world. Consider Figure ^]T] In which the x aids represents addresses. 
Whereas the y axis shows how malicious each individual is (values dose to 0 
mean legitimate hosts -shown as squares -while malicious have higher values 
for;/ -shown as circles). If the local Police Department were to release a list of 
the most dangeroiis areas, the areas could be represented by employing afvfed 
aggr egation level (e.g., only boroughs, shown as dashed rectangles In Figure 
g3 having a fixed size of 4) or variable aggregation levels (e.g., blocks, streets. 

This aggregation Into BadHoods, however, has to deal with two conflicting 

process should minimize the error incurred. 

The first requirement - a concise BadHoods list (in the Internet diis means 

rity software. 
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Figure 3.1: Aggregation into Bad Neighborhoods 



legitimaffi hosts are mistakenly included in ihe aggregaiion process. Consider 
Figure|3l) in the case of the left most fcied-size Bad Neighborhood [BadHoodJ: 

real word: not all residents of a bad neighborhood are necessarQy malldous; 
but somf ate, and therefore, the endre neighborhood may be labeled as bad. 

In this chapter, we investigate at what aggregation levels Internet BadHoods 
should be espressed. considering the aggregation requiremenls aforementioned. 
Since the aggregation level may depend on the Input data, the gaol o/ this chop- 

ciouahosi! into Internet Bad Neighborhoods of various prefixes ^4-/81. 

rithms. The first oae, ficed-prefic (dashed rectangles ingTlj, aggregates mali- 
cious hosts using the same aggregation prefix, while the vorloile-sbe algorithm 
aggregates hosts into different aggregation prefixes (eLipses ing3). Bodi al- 
gorithms deal differently with the aggregation requirements aforementionedi 
in Figure ^ the fixed size algorithm generates a Use of 4 BadHoods {dashed 
rectangles), with an aggregation error proportional to 6 legitimate individuals 
mistakenly included (small squares inside dashed rectangles). The variable-si^e 
algoridim, on the odier hand, yields 5 BadHoods (efiipses), where it wrongly 
aggtegates only 2 legithnate individuals. 



3.1 AflETPgatiim Principles 
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The rest of this chapter is divided as shaivn in FlgurejSlI In Section [STl we 

Nex t, in SecttQn|3.2| we present the fixed-prefii; aggregatton algarithm. in Sec- 
dor [Owe presented die variable preflu aggregation algorithm. After diat, in 
Sectian[3.5|we evaluale [hoss algoridims by using resl world data sets, which 

Is done by employing the metrics defined in Section 1^4] R _.. . 

presented in Secdon|?6} and conclusions are discussed in Section^ 



3.1 Aggregation Principles 

In diis aecUon, we describe Che basic aggregadon operation to aggregate /24 
BadHoods into larger BadHoods, which is employe d by b oth algorithms pre- 
sented in this chapter. We begin Ijy introducing in |3.1.1| the BadHood Score 
employed in die basic aggregation operadon. Then, in SecdQn |3.1.2| we for- 

3.1.1 Bad Neighborhood Score 

Given a list of malicious IP addresses, a /ti BadHood [in CIDR notation ESI) is 
a /iinetblock B" widi a score a™-e(S"), We define diis score as the number of 
malicious hosts in the blocki 



*™-c(B") = #{malldou5 hosts in block B"] 



(3,1) 



Since /24 is "the minimum prefil rouable on the Internet ive use /24 as 
the Slatting aggregation level for IP pretixes in the rest of this chapter, 'ftblell^] 
provides a short ejLample of a /24 EadHood !ist. 

The score value leads to an intuitive definition of the "evilness*" of a net- 
blocks the higher the score, the higher die probability that a host address from 

size of the block. For the following, it is useful to have a nannoiiaed measure of 
BadHood score, which represents the percentage of hosts within a nelblock that 
are malicious. Let B" be a netbloclf of siie /n with score si»rc(S"). We define 




(3.2) 



where luai-Jo.itsiS") = 2^'-" is the maximum number of IP addresses in 
a /" netblock (neglecting the addresses reserved for broadcas. and network 

in the netblock Bhe malicious. ^ 



3.1.2 Basic Aggregation Operation 

Given two /n BadHoods fl," and Bj", these BadHoods can be aggregated into 
die /{ii - 1) BadHood B.'SSj" if B," and B," have a common address preflK 
of Ti- 1 bits. The aggregated BadHood Bi^ffiSj" spans Ihe IP addresses of fli" 
and BJ. R>r example, hi TableBTTl blocks #1 and *3 can be aggregated from 
/24 to /23, while blocks #1 anrP7 can not. 
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3.2 Fixed Prefix Aggregation Algoritlim 

The fixed prefix aggregation algorirhm iteratively aggregates bad neighbothocds 
Into larger nethlocks. In the first Iteration, all 724 BadHoods are aggregated 
intp / 23 BadHoods according to the aggregadon operaUon described in Section 

Wetample, the 724 BadHoods provided m IkblelOlwill be aggregated into 
the 723 BadHoods shown in TablegjU In the next iteration the 723 BadHoods 
are aggregated into 722 ones, and so on. 

AlEorithin|l]presents the pseudocode for this algorithm. The algorithm takas 
as input Oie initial list 5s, of 724 neiblocks S?* wifli score(B?') and the largest 
desired aggregation level m. In each iteration (line 1), the algorithm builds the 
list S„-i of /(rt - 1] BadHoods by merging all pairs of /n BadHoods S^BJ 
(vrii ere po ssible] according to die basic aggregation operadon described in Sec- 
tlon [3X2l (lines 3-5). 

It is important to note that, in this algorithm, empty nethlocks (score = 
n) are Included if no matching BadHood is found for aggregation [lin e 3 in 
Algotiflun In our example, the 724 BadHood 30.30.34.0 m Table jST) is 
aggregated with the zero-score netblock30.30.3S.O. 

The fixed prefix aggregation algorithm effectively reduces the number of 
BadHoods In each iteration because it progressively builds larger nethlocks re- 




backs. First, aggregadng two BadHoods with normalized scores o and i will 
result in a BadHood with, an normalized score of ^ , The larger the difference 
between a and ti, the more toformation about the behavior of the individual /24 
BadHoods in the aggregated BadHood is lost. Secondly, enlarging the BadHoods 
can have the side effect of including also netblocks that were not initially flagged 
as malicious, as already illustrated in our esaraple by the netblock 30.30. 3E.0. 
Thi! effect aggravates with each iteration. 



3.3 Variable Prefix Aggregation Algorithm 

Differently from the fined prefiit aggregation algotilhm. the variable prefix algo- 
rithm does not apply the same degree of aggregation to all BadHoods. Instead, 

Algorithm |2j presents the pseudocode for variable preiis aggregation algo- 
rithm. As In the previous algorithm, the algorithm takes as input the initial list 
Sj, of /24 netblocks with «airc(B,=') and the largest desired aggregation 
level m. Then, for each aggregation level n (line 2). the algorithm merges all 
/n BadHoods BJ', BJ which would form a vahd aggreg ated BadHood according 
to the basic aggregation operation (see Section |3. 1.2) that satis^ the merging 
condition Ofae 3). BadHoods that do not fulfill those conditions are not aggre- 

iteradons. The merging condition is defined as: 



««-gf(sr.B'-') = i.„-i(B;*ffie;)>s-niBx[jv(B;'),!(„iB;')], 0.4) 



Ill AEBTgEatiiiil Algpri 



Algnrithm 2 Variable prefix aggregedao 




ThE condiHon is such lhat we allow a merge only if the resulting normalized 

rhe blacks to be merged. The parameter ^ prevents dierefore the aggregation 
algoridun from mer^g dissimilar BadHoods, Thii value can be tuned accord- 
ing to die scenario and application, ^ ranges between Q.S and 1: smaller values 
make the aggregation less strict, thus allowing more BadHoods to be merged. 

Finally, al line 4, the algoriflim progressively builds the new BadHood set 
by removing BadHoods and replacing them with the merged one. Note that, in 

In order to illustrate the algorithm, we apply it to the example given in 
■ftbleEl] Vol jS = OA we obtain after one iteration the BadHood! shown In 
TibleOj Blocks #1 and #2 are merged, because p54(Bi) = ^,P24[Bs) = 
and ffi Bi) = ^, so jj(Si ffi Bj) > 0,8 ■ = n.OSfi > n.069. 

The odier blocks, on die other hand, do not match the condition, so they are 
not aggregated. After the first iteration, rhe list contains both 723 and /24 
entries. In the next Iterations, no further aggregadon occurs, and the final 
result contains entries using mixed prefixes (/23 and/24). 

Comparing the results of hodi algorithms (Tables g3| and [Of for die first 
iteration, the output of the variable prefix algorithm has more entries. How- 
ever, the blocks aggregated by the variable prefix algorithm have been matched 

Finally, we have implemented both algorithms in a Java prototype. We 
have observed runtimes of less than 10 seconds even for large input files (IM 
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3.4 Evaluation Metrics 

In this section we introduce the metrics used to evaluate ihe aggregation al- 

hawe to deal with two conflicHng requirements - generate a condse BadHood 
blacklist and minimize the aggregation error. 

The evaluation metrics are derived from these requirements. The first one 
is the reduction achieved by the algorithms in terms of number of entries in Ihe 
Initial BadHood input list. We measure it as the differeiice berween the number 

In the resulting list generated by both Hxed prefis and variable prefix algorithms. 

consider an (hypothetical) applicadon, such as a Spam filter, that rehes on the 
aggregated lists. Letbe {.X", r", . . .} a set of /24 BadHoods with normalized 
scores {paj(y''],p.ja(i"''), , , .). We can interpret psi(Ji:^'] as the probability 
that a particular IP address In block X*" be a source of malicious activities. 
After we have aggregared die /24 BadHoods to a /« BadHood B^, with n < 24, 
only the normalized score p„(£^) of the aggregated BadHood is available to 
the application. The "evilness" of a part icular IP address in X" can now only 
be estimated by p„(B^) (Equation (3D). Consequendy, we define flie error 
e>r(X=^) introduced by the aggregation for the BadHood A'=^ as 

(underestimate) the evilness ofX-' after the aggregation. To assess the global 
error for an entire blacklist, we sum up the absolute errors for each /24 BadHood 

E,-r.t. = X;krr(Jf^-']| (3,6) 



(3.7) 



The difference between bolh errors is thai BrT-,„„,, places greater weight 

differences between the error after aggregation (p„(S;')) and before the aggre- 
gation ijii,(X'^))- 

lb illustrate how Ihey are calculated, consider block Bi hi 'fable Before 
being aggregated into a /23 BadHood. the notniajisied score p(jBi) Was ^ 
(TkblelO. After that, the same netblock gets the mean value ^. The error 
srr{BiY^ ^ - m = -0.0019. After calculating the individual errors, the 

The inteipretation of the error values depends on the apphcadon and other 
deHnitions of the global errors are possible. Por example, for an intrusion de- 
rates. In such a scenario, calculating the global errors separately for positive 

of BJiarticular application and suitable for different scenarios, we have chosen 
the rather flexible deBnidons in gje) and lISTTt. 



3.5 Evaluation 

ttonjsXT) The performance of both algorithms is compared for the lar gest of 
our datase ts, die Composite Blockmg Ust (CBL; see below), ui Secdon p:5:i| 
In Section [3.5.3[ we study the Impact of the merging parameter 13 on the per- 

resulls for different datasels in SeSon lSXl * » P 



3.5.1 BadHood Input Blacklists 

We evaluate our aggregation algorithms on the real case of a Spam blacklist. 
The considered data set is the Composite Blocking Ust (CBL) Ojl - an online 
Spam DNS blacklist. CBL maintains four large spamlrap Infrastmctures from 
Where the source IP addresses of spammers are harvested. We have obtained 
the list for the April 28th, 2010. On this day, CBL listed 8,177,138 /32 IP 



addresses, which re sult in an inidal biacklist of 960,167 724 BadHoods, As 
desccibed in Seclion 13.1.11 we slart with /24 since Ihis is the minimum pteflx 
■Youlable" on the Internet. 

menisinSectiDnHj31 

• Passive Spam Block list (PSBL) J^, obtained on AprU 28th, 2010: the 
Ust consists of more than 2.aM /32 distinct IP addresses; 

> Passive Spam Block List (PSBL) CD. obtained on October 24th, 2011: 
the list consists of more than 283K 732 distinct IP addresses; 

the Netherlands, We have obtained the IP addresses of spammers on April 



3.5.2 Performance of the Aggregation Algorithms 

In this section, we present the results of the aggregation algorithms applied to 

of the aggregation. Then, we discuss the impact of the aggregation algorithms 
□n the global eirois. 

Blarklist siie 

Figures [3.3(a)| shows the number of entries Cm thousands) m the resulting 
blacklists as funcdon of the aggreg ation level m for the fixed prefls aggregation 
algorithm, whereas Figure ^TifaTl shows for d'^ variable ptefli; aggregation 
algorithm. For the later, we have chosen a rather moderate merging parameter 
of = O.a, The Influence of the parameter Is discussed In Section 

As expected, both algorithms are able to reduce the number of entries of rhe 
initial input hlackUst. If compared, however, we can see that their performance 
in terms of the number of entries Is very dissimilar. The fixed prefix algorithm 
progressively aggregates listed BadHoods into larger netblocks, regardless their 
scores and normalized scores. As a resiilt, the number of entries decreases with 

The variable prefix algorithm, on the other hand, only aggregates blocks that 
meet the merging condition specified in j3l4t . As a result, once no more candi- 
date blocks satisiy the condition, the number of entries in the blacklist slab Lizes. 
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preSs algorithmreduces the booklist, from the origtaal 960,157 /24 BadHoods 
lo 162 entries for /S, the variable prdix algorithm stabilizes at around 71 lit 
enlriea. However, the /8 fix prefls aggiegalion level is very aggressive and is 
expected to generate large aggregation errors. We show next aggregation error 




Figures lUgg and |33(|j] also show the global absolute errors (see fD) for 
the two algorithms as hinction of the aggregation l evel. T he results for the 
global squared errors (seelg^) are shown in Figures [OtE^l and [OfEl 

First and foremost, we observe that the fixed prefix aggregation algorithm 
results In much larger errors than the variable prefix algorithm. This is an ex- 
score. Therefore, many dissimilar blocks (in regard to their scores in 13.1D } are 
aggregated, leading to large differences between Ihe normaiized scores of the 
individual /24 blocks and the normalized score of the aggregated block. 

the errors almost linearly increase with the (decreasing) aggregation level, al- 
though the achieved reducdon of die number of lines is not linear at all. This 
is due to the fact diat die algorithm also considers empty blocits, i.e., blocks 
wilii a score of 0. Aggregating an empty block with a non-empty block does 

vUlb the aggregation level {see Section [33t - In fact, up to around aggregation 
level /IS, substantial reductions m the number of lines are achieved by the al- 
gorithm. After this point, as we can see Irom the constant error increase, the 
aggregation of two netblocks becomes more expensive In terms of errors. At 
aggregadon level /8, the absolute and square errors are respectively 2.36 and 
2.8 time larger than at the /IS level, as eitpected. This leads to the conclusion 
that the aggregation to larger netblocks (small prefixes) has a huge impact on 
the correcmess of the final blacklist and, consequently our analysis proves that 

than /& (e.g„/20for this case, depending on the input data). 

In contrast, the error curves of the variable prefix aggregation algorithm 
mosdy mirror the achieved reduction of lines. Both Ihe number of lines and 
the global errors significantly change up to around level /IS. After this point. 
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Figure 3.5: Vanable Prefix Aggregalinn for 3 = O.R 



enor Isee mjie^^!^^ ^ 



Distribution of malicious hosts 

As already stated, the variable preEi aggregation algorithm achievesmost of tiie 
reducdon when moving from level /24 to /23. After that, only a small pordon 
of the BadHoods fulfill the merging condition and can be aggregated further. 
The bar chart in Figure g3J [left a anis) shows the resulting distribution of the 
BadHood sizes for aggregadon level m = S (i-e.. /8) and a = 0.8, As can be 
seen, a large portion (around 58010 of the mitial 960,167 /24 BadHoods are 
not aggregated at all and sray at level /24. Around ia9lt entries are aggregated 
into /23 BadHoods and only a few entries are aggregated Into /22 or higher. 

host addresses (right y axis). Reme mber, diat the original data set contains 
8,177,138 host addresses (see Section [3331 . According to the figure, around 2 
mfllion host addresses slay in /24 BadHoods after aggregadon, but most of [he 
bad hosts can now he found in /23 through /17 BadHoods. 
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tribuH™ of the^BadHood sizes, even when considerinE lhat s /(n- 1) BadHood 
is twice as large as a /n one. For example, we have observed an average of 3.38 
malicious hosts per /24 BadHood in Figure gjSj (dividing the right y aiis value 
by the left y value). IntuKvel)', one could expect the average for /23 to be twice 
the average for /24- which is 5.76. However, the average is, in feet, 11.29, a 
value is 67*6 above what was expected. This can be explained by the nature of 

subnetwork, it is natural to expect that similar netblocks can be found in its 
own neighborhood. Such netblocks are, then, preferred by the merging condi- 

beneBts of the aggregation. 



3.5.3 The Impact of ,il on the Aggregation 



In the following experiments we study the impact of the merging parameter B 
on the pelformance of ttie variable prefix aggregation algorithm. Intuitivel); 
if ;( is too permissive (fl dose to O.S), it might results in a blacklist in Which 
most of the blocks are aggregated, while a more strict value for fl close to 

aggregation algorithm would also result in a larger aggregation error, while a 
algorithm aggregating only very similar blocks would result in a small error, 

Figure |3^ showg, for varying values of 3, the number of entries in the black- 
list output by the aggregation algorithm, as well as the corresponding global 
errors [absolute and squared). For S = 0.5, the resulting BadHood list contams 
around 470k entries, Por increasing values of g, the blacklist becomes progres- 
sively larger, while the errors decrease almost linearly Finally for 0 = 1, the fi- 
nal biackUsthas almost the same number of lines as the original non-aggregated 
one, since the algorithm only aggregates valid netblocks with exactly the same 
normalized score. Therefore, no erroc is observed for ,3 = 1. 

elHcient blacklist and having a small merging error. Therefore an appropriate 
value of ,3 should be chosen case by case, and, accordingly tn the scenario, the 
security manager should decide to favor a fast blacklisting process, or a precise 



3.5.4 The Impact of Different Blacklists on the Aggregation 



rithm fot Che other data setspresented to Section pXlf We ordc ^etesult^of 

In Figure we show the numba of lines of the result blacklists relative 
CO the original sizes of the /24 data sets, as computed by the variable prefix 
aggregation for varying aggregation level and (3 = O.S. We observe that our 
aggregation algorithm is able to reduce the blacWisc size for eadi of the con- 
sidered data sets, I'ot 8 = O.S, the data sources experience a reduction on the 
number of entries from 10% for the "Provider A" data set to 26% for the CBL 

A second observation is chat the two largesc lists (CBL and PSBL) from April 
2eth, dearly benefit more From the aggregation than the smaller lists. This is 
expected because the BadHoods in the smaller lists are more sparsely distributed 
over Che Internet address space and, hence, are harder to aggregate, hi addition, 
the"Provlder A' data set experiences the smallest reduction of all four traces. 



3.6 Related Work 



Badf Si 1993, only classfull addresses were used (brraer dasses A. B. and O. 

routers beyond the ability ol^ current software, hardware, and people to effec- 
tively nianage" |^ . 

Therefore, in 1993 riie IETF Introduced the Classless Inter-domain Rout- 
ing (CIDR) addressing ESI, and the prefls nntation used here. This new ad- 
dressing scheme allowed blocks ta be allocated under prefixes different than 
the ones speciiied by dasses A, B, and C. That allowed route entries with the 
same preflu to be aggregated in what is called supemeis By aggregating 

and that decreased the requirements for storing routing information on routers 
and the overhead when matching routes. Current BGP routers have typically 
372k enu-ies in their routing tables MH , a small value compared to current /24 
BadHoods blacklists, such as CBLUM-i- entries) . 



3.7 Conclusions 

into IP prefixes expressed ta CIDR notation. We have proposed two algorithms 

The first aggregation algorithm - fised prefix - has proven to be very efficient 
when it comes to reducing the number of Imes in BadHood blacklists. By aggre- 
gating tiie entries to /IS, we observe a reduction of B3aS% on the /24 original 
size. However, the error incurred by this algorithm is high. For the evaluated 
data, the results have shown that the aggregation further than /18 corresponds 
to a reduction of very few entries at the expense of a large error. The implica- 
tions of a large eiror is that a large number of hosts end up betag label as bad 
(by taking part of the aggregated BadHood) even though no malicious activities 
have been observed from such hosts. 

The second aggregation algorithm -vorfubleprffir- has been designed with 
die goal to aggregate BadHoods if they are suffidendy similar - defined by the 

diat it is able to reduce the aggiegatlon error. However, the final blacklists It 
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Internet Bad Neighborhoods Location 



LJ istence of Internet Bad Neighborhood! is due to Ihree possible reoions: 
(i) that some Internet Service Providers OSPsl neglea malicious actiTities in 
rheir networks, Cii) that malwarc is more likely ro spread on rhe networks of 
such ISPs, and [iiij diat non-technical local factors may play a lole, such as 

The goal of Ibis diapter is lo address these assumptions. We investigate 
them tiy assessing the latal number and Che ratio of malii^ious hosts found in 

The motivation to carry out this research is to evaluate our assumptions and. 
ultimatel^^ to provide network administrators with concise information to better 
protect their networks. E.g, this can be used to filter tta/ficnot only based on IP 
addresses, but also on the ISP and/ot their geographical origin. 

ISPi? 

t Research Question 4.2 (RQ. 4.2): How are malicious hosts distributed over 
geographical areas (counlries and cides)? 

RQ 4.1 focuses on evaluating the first two assumptions for the esistence of Inter- 
net Bad Neighborhoods - that some ISPs neglect malicious traffic and mdwart 
propagation in [heir networks. These assumptions should hold in case we find 
ISPs having a significant concentratton of malicious hosts. 
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RQ 4,2, on the other hand, coven the third assumption for Internet Bad 
Neighborhoods (BadHoods): thalnon-tedinical local factors play a role. A con- 
centrated dfstribuHon of mallciou! hosts in a limited number of countries would 
supporr our assumpdoD, 

for two applications (spam and phishing) for a period of one week, from July 
190) to 25th, 2012. After that, for each data set, we have extracted the /32 W 
addresses of die malicious sources. To answer RQ 4. 1 , we have aggregated [and 
ranked) malicious hosts into Autonomous Systems (AS) 07) and organizations, 
whQe to answer RQ 4.2 we have aggregated (and ranked) the IP addresses mto 



The rest of this chapter is divided following the structure presented in Figure 
FT] first, we review in Section |4Tih ow IP addresses and ASes are allocated 
on flie Internet. Then, m Section^ we show how to map individual /32 IP 
addresses into ISPs and geographical location, while in Sect ion I^Tsl we cover the 
datasets employed. After that, RQ 4.1 Is covered in Section K-4| while RQ 4.2 Is 
addressed in SecdonO Section gjdiscusses the related work and Section [4^^ 
presents the conclusions. 
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4.1 IP Addresses and ASes Allocarion 

In order lo undfrstand how it is possible to map individunl !P addresses [nto 

and ASes are^allocated on the I'nteraet, ^ 

The best practices for IP address allocatinn are covered in theRFC 20S0 1^, 
which we briefly review in this secIiDn, The entitj' responsible globally for this 
task Is the Internet Assigned Numbers Authority (lANA) Ea). which is a depart- 
ment of the non-profit private organization Internet Corporation for Assigned 
Names and Numbers [ICANN) E3J, located in the United States. 

lANA performs irs tasks by delegation: ir aUocates entire /S prefixes (tg.. 
120.0.0.0/8 in CnjR notation (SI) W the Regional Internet Registries (RIRs). 
The IPv4 prefixes assigned by lANA to the RIRs can be found at (53 , while the 
assigned IPv6 global unlcast prefixes can be found at ^ , Currently, there are 
five RIRs, divided according to geographical regions: 

• African Network Information Centre (AfriNIC) ESI : RiR for Airica. 

• American Registry for hilertiet Niraibers (ARIN) MS- RIR for the United 

• Asia-Pacific Network Information Centre (APNIC) USS): RIR for Asia, Aus- 
traha. New Zealand, and neighboring countries, 

. Ijtin America and Caribbean Network Inforroadon Centre (LACNIC) EEl : 
RIR for Latin America and parts of Caribbean region, 

• Ristaia IP Eiirop&ns Network Coordination Centre (RIPE NCC) EH ; RIB 
for Europe, Russia, ihe Middle East, and Central Asia. 

FiRure shows a "snapshol^' of the the IPv4 aUocation map in 2006. In this 
figure, each of Hie 255 numbered blacks represents one /8 netlock (CIDR no- 
tation), Euranged according the Hilbert curve ED, whQe green blocks were 

After obtaining these prefixes, each RIR re-allocates IP address ranges to its 
customers, typically ISPs and other organizations. The aUocadon information 
Is kept in a public database, which is made available dirough die nhola ^3 
software. Listing |T] shows a partial output of the wbois command, issued to 
queryARIN for the IP address "208.30. 152.201", As we can observe, the prefix 
208. a 0,152.0/22 is allocated to WIKIMEDIA (NecNama), 



4.2 Mapping Principles 




can observe that the IP address 203.60.152.201 belongs lo the network pre- 
li« 208.60,162,0/22 (CTDR field) and its Origin AS is AS14907, which is the 
Autonomous System Nnmber [ASN] of the Wikimedia Foundation^ 



4.2 Mapping Principles 

In this section we show the mapping principles of /32 IP addre sses to ISPs in 
Section RXl) and into seosraphical information in Section |4X2| 

4.2.1 Mapping □> addresses and ISPs (ASN) 

In order to map malicious IP addresses Into ISPs, we employ the ASN associated 
to the IP address in question. By definition, an ISP is dtmys a transit AS - that 

ASN isauFiiijifenuTTifcerand by using ASN to identify ISPs, one can easily filter 
nafflc in network firewalls/IDS, 



However, as deHred in Section 5 of HFC 1930113, not eveiy AS is an JSP. 
In fact, any organizalion could poKnIially apply for an ASK, but il is not neces- 
ss.1% since organizations can employ the ISP's ASN even if the organiialion has 
bef n allocated with multiple IP prefixes by the SIR. To iQuscrate this, consider 
FIgure23] In this figure, the ISP iSp, with lis own ASH, provides connectivity 
to three oiganizations CDrgl , GTg2 , 0Tg3) to the hiternet. Orgl has been al- 
located with ira own ASN and network prefixes by the RIR, while DrgS has been 
allocated only wifli network preibies. OrgS, however, has nor been allocated 
with prefixes or ASN by die RIR. 

hi rhis scenario, a maiidous IP address x from Crgl is seen on the Internet 
as part of Drgl (NatName field in Lisring|g - that Is, is llsred on the RIB'S public 
database as allocated to Orgl. In addition, x is also seen as part of the AS of 
Drgl, since Qrgl has its own ASN (OriginiS in Lisling|T] In this case, the ASN 
associated to x does not represent the ISP, but Drgl. An eKample of IP address 
diatfits into this category is 1, which is allocated to the Canadian 

Law firm Sdkeman Elllof , which, has its own ASN [AS 25805^) and employs 
Cogent Communicadons (AS 174) as its ISP^. 

with a prefix that contains n by the RIB. However, y is seen on the Internet 
as part of ISP's ASN, smce 0rg2 does not have its own ASN and employs ihe 
ISP's one. In this case, the ISP would "he blamed" as source of maiidous IP 
that has been, m fact, allocated to another Oigl by MR. This is the case for the 
IP 12.m.l9.i, which Is allocated to coffeehouse chain Starbucks'. Starbucks 
employs AS17226, which belongs to AF&T a major ISP hi the United States. 

been allocated IP addresses by lUEK In this case, flr^ employs IP addresses 
from its own ISP, and also the ISP ASN. A small business with ADSL connection 
is a good example of this case. It is important to notice that these organiaarions 
are not registered at the local RIR. and therefore are globally seen as part of die 
ISP, 

To cope with the fact that not all AsN are associated to ISPs, we aggregate 
maiidous IP addresses mto bodi ASes-based BadHoods and Organization-based 
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Several sources can be used to map individual IP addresses into ASN. For 
example. Team Cymru builds a database of IP to ASN mapping based on BGP 
feeds of more than 50 feeds, updated every four hours gjj. Other providers, 
like MajiMind, provide as weil a publicly available ASN to IP database, named 
GeoUle Autonomous System Number Database (53. Due to simplicity of use. 
we have employed the MaxMind database to resolve IP addresses into ASNs. 



Mapping IP addresses Into Oisanlzatlons 

Since organizations can have Iheir own ASN. il is necessary lo also aggre- 

2050 gl|,organizationsare allowed to ask the iUR for a block of IP addresses. 
If the organization flilfiUs the RFC 2050 requirements for having IP addresses 

information pubh'dy available via whoia. En Listing |l] we can observe that the 
prefix 208. €0-1 52, 0/22 (in the CIDR entry) was assigned to Wikimedia R>un- 
dation (Or^ama entry). 

Each organization can therefore choose any provider available lo connect it- 
self (and the assigned IP addresses) to the Intemet. By aggregating IP addresses 
into organizadon-based BadHoods, however, we are able lo spot which organi- 
zations are owners of the most acUve malicious IP addresses on the Intemet, 



regardless the ISP they choose to connect to the hiternet. To resolve an IP ad- 
dress to the organizBtion, one could use the Rffl. publicly available rtioiB data- 
bases. However, the Bhois output format varies - that is, some organizations 
provide more informadon than others, and the field names may vary Therefore, 
in this chapter, we employ Che standardized database built by MaxMind, which 
is based on the HIR Bhois databases, for ease of use . 

ASes: are they legally responsible for malicious trafRc? 

up using their ISP's AS Ce.g., such as 0rg2 and Org3 In Figured, one could 
wonder if an AS should or should not be held legally responsible for malicious 
traffic observed In the network of the clients (e.g., IPs t and ij from Orgl and 
Drg2). 

This quesdon is very controversial, as discussed by G. Houston It deals 
with the question if an ISP should or should not filter content in their network, 

lion. Therefore, we recommend that the answer should be sought In a multi- 

Secdo^. ^' ^ °^ ' ^ ^ ' 

case of AS267aO, back in 2008, The AS267aO, belonghig to die San Jose based 
web hosting service provider McColo Corp., was disconnected from the Internet 
by two of thefc upstream providers (Global Crossing and Hurricane Electric), 
due to the large amount of malware and bowels in their networks 05). The 

shown diat die volume of worldwide spam was reduced in 67% gi) . SilcCoIo 
was owned by the Russian national Oleg Nikolaenko (named "King of Spam" by 
the FBI), who is currendy being held m prison in the United States. 

Begardles! the legal question of whether an ISP should be held responsible 
for the transit trafiic on dieir networks, in this dissertation we aggregate /32 
addresses into AS-based BadHoods ui order to identify which ISPs malicious IPs 



4.2.2 Mapping IP adilresses into Geographical Information 

In this section we present the mapping principles of /32 IP addresses geograph- 
ical location. We first focus on country-based BadHoods, and dien estend die 
concept to dty-based BadHoods. 
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deliver cusromized ads according to die user*s location), fraud detection Ce.g., 
online stores can check the physical Location of a client against its bflllng ad- 
dress), media licensing Ce.g., broadcasteis, such as diose on Hulu (M, only 

atering Mil- 

Even though the organization's city and country is provided hy the the Rffi. 
database via Blxoia (as shown in lJslins[lf. that does not mean Uiat these are the 
geographical locatioDS of the hosts having these IP addresses: this is actually the 
physical address of the organization, which may have their computers located 
in a data center hundreds of mQes away from the physical address. 

As described by Poese el oL IggiligZ] , there are currently two main paradigms 

driven geolocadon. Active IP techniques are typically based on network delay 
measurements - but they do lack scalability and present a high measurement 

engine(e.g„ SQVMySQL) conmimng records for a range of IP addresses, which 
are called blocks or preliKes" ESIESl. 

bases Is usually not made explicit [thereture, its precision Is questionable), Poese 
It a!, have investigated the reliability of die databases's geolocation infbrma- 
ticin. They have carried out an experiment comparing the results obtained [irom 
the database with the actual results of a large European providet They found 
that the databases perform very well when geolocating IP addresses to country- 
level [95% to 98% success rate, depending on the database), while far dty-level, 
the results were far less precise: for die Maimmd database 033, 60% of the 
locations presented a 100km location error. However, the experiment carried by 

country (most hkely Germany, due the authors' afflllatloa and the BOOkm maid- 
mum distance limit In the country), which might not reflect the overall accuracy 
of the database. In addition, for the case of MaxMInd, the company provides a 
web page wlih the precision of their results 11041 . example, tbr GermanB 
die company claims that it is able to resolve 78% of the IP addresses wlthm a 
range of 40km, while 18% are wrongly resolved and 4% belong to unknown 

Even though there are limitations regarding city-level geolocation, still It is 
widely used by many companies. In this chapter, we have chosen to employ the 
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Maxmind database. As shown by Poese st at. ^ [Ml , Maxmind HTgSlI is one 



4.3 Evaluated Datasets 

In order lo answer the researdi questions raised in this chapter, we have ob- 
tained representative real world data sets (bladdials). Since we want to evalu- 
ate if the results hold for different malicious activities, we have obtained black- 
lists for spam and phishing. In addition, we have chosen these sources because 
they have been employed both byreseardi and Internet security communities. 
The chosen data sets arei 

• Spam; for spani blacklists, we have employed the Composite Block List 

DNS blacklist (DNSBL) of the highest possiMe quality^d teliability, that 
own spam trap inftastructure. and the IP source address of any message 

• Phlshlngi we have obtained data from Phlshtank, which is an open cora- 

websites" llOfil . It provides a blacklists of URLs that contain lorged web- 
sites. Since we need IP addresses instead of URL to proceed with our 
analysis, we have obtained this bla cklist and resolved all the URIa to IP 
addresses using Google Public DNS M3- 

After choosing the data sets, vre have obtained data for the same monitoring 
period! from July 19th to 2Sth, 2012. We have then generated a final blacklist 
containing all /32 unique IP addresses observed In the monitoring period, far 
both CBL and Phishcank data sets. In the end, we have obtained !),320,I57 
unique /32 IP addresses of spam sources, and 3,016 unique /32 IP sources of 
phishing sites. 

Both spam and phishing /32 blacklists were then aggregated into AS-based 
BadHooda, otganlzation-Based BadHoods, country-based BadHoods, and city- 
based BadHoods, using the approach described in Section with the help of 
a small program developed in Java. 



4.4.1 AS-based Internet BadHoods 



AS-based Spam EadHoods 

Tablsgr2|shows the Top 20 ASes ranked according lo the lotal number of spam- 
ming IP addresses labsolute numbers). In this table. Sources refers to the num- 
ber of malicious IP addresses observed, while IPv4 Orig. refers lo the num- 
ber of /32 IP addresses die autonomous system announces [including its own 
prefixes plus dre prefixes of its customers diat do haje their own ASN). We 
have obtained this informadon from the Hurricane Electric" BGP toolkil web 
site ESSl, which generates it based on die BGP taljles. We could have also 
obtained the same information from BGP routing tables from other soinrces; we 

By employing the number of IPv4 addresses originated per AS, We are able to 
calculate die ratio of compromised IPs within the AS in quesdon. This is a veiy 
important metric, since it shows the percentage of compromised IP addresses 

(Source3/IPv4 Gtiginated)). 

its country o'^ri^in fl^bleg;^ 

As can be seen in the table, the first AS in terms of spamming IP addresses 
is AS9829, which belongs to BSNL (Bharat Sanchar Nigam limited). BSNL 
is a slate-owned telecommunications company - including telephony [mobile 
and landline) and broadband Internet, being die largest in India. The second 
AS in terms of spamming IP addresses is AS4SS9S. which belongs to Pakistan 
Tfelecom Company Lunited, As BSNL m India, Pakistan Ifelecom is the major 

originated from die AS in relarion to the all malicious IP addresses for the same 
counlry (column Country's %), This infoimation can he used by government 
authorities to tackle, wlrfiin their own country, the ASes having most of spam- 
ming IP addresses. R>r example, 5 out of the 20 listed ASes in Thble|4^are 



4.4 ISPbased Inigmet Bad Hoods 71 




In their networks {as discussed In Secaonl^g, this provides a dear Indication 
of the "health" status of [heir networks. 

Finally, in the Worid's %, we can observe the percenrage of malicious IPs 

AS9a29, from India, w^ responsible for 7.39% of all the observed malidous 
IP addresses. This Is a very large number for a single ISP, considering that 
[here were 42,201 active ASes at the moment of this analysis. PiKurc|4?|shoWi 
the percentage of spammfng IP addresses for all ASes observed axis refers 
to the AS number (ASN)}, As can be noticed by rhe peaks, few ASes have the 
highest ratio of spamming IP addresses in relation to the total observed. These 
results suggests diat ASes provide a very good aggregation criteria for identify- 

BadHoods that some ISPs neglect/ium a blind eye to malicious activities in their 




addresses In these ASes (Ratio column). We have observed 15,078 AS having 
spamming IP addresses, having, on average, 0.58% of their IP addresses sending 



72 



Taking this into account, when analyzing l^bkg^l we can nbserve, the 

JP addresses observed in our data sets, from a total of 42,201 active ASes m 

BadHDods at the ISP Levd™ ^ 

In addition, among the top 20, the AS that has the largest ratio Is SaudiNet 
(27.65% ol its IPv4 addresses). Once again, diis finding supports our assump- 
Hon about ISPs that neglect malicious traffic. ChinaNat. on the olher hand, is 
ranked 5th in absolute number of malicious hosts. However, this AS announces 
more than 110 million IP addresses - and the spamnung hosts within this AS 
represents only 0.23% of the total - which Is below the average ratia observed 

diDUgh the ratio for Chinanet is smaller, one can not neglect the potential dam- 
age it can incur due to its absolute number of malicious hosts in comparison to 
SaudiNec. 

Regardless Che ratio exhibited by rhe top 20 ASes, it is important ro em- 
phasize that these ASes have an alarmtog large number of spamming hosts in 
their networks, and are truly "spam havens", irom which spammers can operate 
almost £reely 

Since ASes can have different "sizes" (that is, the number of IP addresses 
that they orighiated), we present In Table|43|the top 20 ASes ranked accord- 
ing to the ratio of spamming IP addresses. AS2S01SI, belongmg to SaudiNet, 
from Saudi Arabia. Is the only AS that appears in this table and in Table [43| 
(the particular case of Internee censorship in Saudi Arabia will be discussed in 
Section |4XI^ . In this table, the worst AS Is AS37340, in which 62.55% of Its 
IPv4 were found spamming. However, this AS can be seen as a small one, smce 
it oiily originates 5,632 IPv4 addresses. In fact, in this cable, 17 of the 20 ASes 
can be seeri as small ones (IPv4 Oiiglnated < 25,000). In such providers, a 

sparaming {< 7,169 pet AS). 

Whai we can conclude is that the Bad Neighborhood phenomenon is cer- 
tamly existing at the AS-level; in only 10% of the 15,078 ASes have a concen- 



AS-based Phlshing BadHouds 

Thble g2| presents the top 20 ASes that were found hosting phishing IP ad- 
dresses. The number of phishing IP addresses ranges from 22 to 140. Ail the 20 
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Since the number of phishing IPs is smaU in comparison to the number oF 
IPv4-equIvalent IP addresses announced by an AS, we do not proceed wi± the 
calculation of ratio of phishing IPs In telaHon to the total observed. 



4.4.2 Organizations-based Internet BadHoods 

As discussed in Section [iSll ISPs may connect to Che Internet IP addresses 
belonging to other client organizations (Orgl and DrgS in Figure . This, 
consequendy, implies that the ISP might be, m fact, routing malicious traffic 

In order to clarify the responsible organiiatlon for orl^athig the malicious 

Differendy from AS-based BadHood, the organizadon-based BadHood is done 
by Bggregatmg /32 IP address according to tfie organizadon they belong to - 
that is, the organization that has been allocated with that particular IP address. 

TSbleKs] presents die Top 20 Spamming Organizations In terms of Spam- 
ming IP addresses. Comparmg diis table with TSible^ we can obserue diat 
13 out of the 20 AS owners are found as the most Spamming Organizations. 
In addidon, 4 organizadons are subsidiaries of die AS owners; VDC is a sub- 
sidiary of the AS-owner VNPT, Reliance Comniunicadons has acquired the AS 
owner BSES Telecom, Telemar Norte Leste S.A. is part of the AS-owner Tele- 
comunicacoes da Bahia S.A., ChinaNet Guangdong Province Network is routed 
using ChinaNet ASN. The new organizations in the list were FPT Telecom, from 

nicatlons In Spain and Germany, respectively 

TShlegTsI shows the results for the otganliation-based Phishing BadHoods. 
We found diat 15 out of 20 organizadons, 14 arc die same as the AS owners 
showed in Tableg^ One organization (eToxIc) is connected via SottLayer, the 
#1 AS in terms of phishing IP addresses. New organizadons where die follow- 

whkh have their own iSP and ASN. 

IPs. The reasons for that Is that organizadons having as core business Internet- 
related acdvlties are more likely to have more IP addresses allocated, which 
increase! the chances of having more malidous iP addresses. To Illustrate this. 



id NeighborlKKMla Location 
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(which produces Budweiser beer) and the Duich Tele comrnunications Company 
KPN. Even lliough these companies have a comparable number of employees 
(~ 30,000) and reveoue (~ €13 Billion), KPN's AS2S6 announces 3.4 miUion 
IP addresses, while Anheuser-Busch^s AS1S117 announces only 69.6 thousand 
IP addresses (a factor of 50). 

Therefore, these findings also support the findings of previous sections, in 
which we have shown that die first two assumpdon behind BadHoods are valid. 



4.5 Geographical Internet BadHoods 



K.5-2t having the highest concentrations of malicious IP addresses'. ' 
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4.5.1 Country-based Internet Bad Neighborhoods 

Counlry-tiased Spanuning BadHoods 

counoy In Figure |4!|] In this figure, the color assigned to eadi countiy cor- 
responds 10 the nuinlier of spamming IP addresses found in each country, as 
shown in the legend. Analyzing tlie figure, we can obseive that: 



tries weramponsifciejor 76,31% o/aU ffte spommins JP addresses - which 




• The ERIC countries (Braid, Russia. India, and Chdio) ore among the coun- 
tries with most maficiDus ftosfs. These countries currenfly elperience a sig- 
nificant economy growth, and, in comparison to tlie advanced economies 

net access (The internet penetration ratios ate: Brazil - 40.6 %. R ussia - 
43,0%, India - 7.5%, China - 34,3%, World Average - 35% 11X^ 1- Ac- 
cording to a Boston Group report llllll . the Internet penetration should 
incieaae between 9% to 15% per year until 2015 in riie BRIC countries. 

we can expect the number of malicious host! in these CDuntries to increase 

scenario, if India would have Ihe same Internet penetration rate of a com- 
parable large country- the United States (79%) - v*ile keeping the same 
ratio of malicious hosts, it would have, alone, almost 20 million spam- 
IP addresses' we have observed in our datasets for die whole world, ^ 

Tflblegl7|presenls the top 20 countries having most spamming hosts. In this 
table, CC stands for Country Code {please refer to Appendis |D| for the the list 
of countries codes). Pop refers to t he co untry population, in millions (obtained 
from the United Nations website Iin2l 'i, while SRCs refers to the number of 
malicious hosts observed per counny. Finally, Ratio as the number of malicious 
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lie and PcopuitiDnal Eo Ibe PopulBCidn] 



Analyzing the left part of Tabieg7|(Absolule), we can observe [he countries 
that can, potentially incur more "damage", measured by the absolute number 
ofhosts (SRCs), In this table, hidia is the number one country, followed by 
Vielnam and Brazfl. Out of the 20 countries, 17 ate classified as developing 
countries, whereas Germanji Spain, and the United States are the only devel- 
oped nadons (or advanced economies, in the OA tenninology IITlgl ") in the list 
of top 20 countries in absolute number of malicious spammlng hosts. 

On the right side of Table g5](Proporttoi!aO. we present the top 20 coun- 
tries having most spammlng hosts pet million Inhabitants, which prai^des an 
overview of the networks of these countries. Countries codes in bold font ap- 
pear in both absolute and proportional lists. We can observe that India, the 
country having the highest absolute number of malicious IP addresses, does not 
even appear in the [op 20 in proportional terms. In fact, none oF the BRIC 
countries are among the top 20 most proportionally spamming countries. Fur- 
thermore, all the countties in the proportional list are classified as developing 
nations. 



Internet Censorship Countries 



liieir access to the InlerneL These coimuies are: 

• Saudi Arabia CSA): According to the OpenNer Inidative (a joint project by 
Harvard and Oxford universities - among others - that has as nussion "to 
identify snd document internet filtering and surveillance" Iml l. Saudi 
Arabia uses a pro^ farm in King Abdulaziz Oly for Science S Technology 
to "filter sitps related to opposidon poKdcal groups, human rights issues, 
and religious content deemed offensive to Muslims^ Pornographic sites 
are pervasively filtered, as well as cucurrrvention and online privacy tools. 
Bloggers have been arrested, and blogs and sites run by online activists 
have been bloclced." 11151 . Saudi Arabia is hsted in the "Enemies of the 
Internet" Ust by Heportcrs sans Fmnli^rcs (RSF), which is a F^-based 

pre^s^Bia- ° 

• China (crfli China is responstble for maintaining the "Great Ftrewall Of 

cated tegimes of Intetnet filteting and informadon control in die world" 
llll'/l . The government performs pervasive filtering on political and con- 

teraet tools. China is classified as an "Enemy of the Internet' nation by 
RSF. 

• Belarus [BY): Also part of "Enemies of the Interaef list of RSF, Belarus 
net tools areas, according^to tesis carried out by OpenNet ITial . 

• Kazakhstan (KZ): Kazakhstan is listed in die "Countries Undet Surveil- 
lance" list of RSF. According to OpenN et, K azakhstan preforms selective 
fllteting on polidcal and social content itffg) 

• Viemam (VN): Vietnam is classified as an "Enemies of die Interael" by 
RSF. It petfdrms pervasive fiheting on polidcal content, and selective fil- 
tering in social content accordmg to OpenNel II120I . 

• Hinisia (TN); Hinisia is classified as a "Country Undet Surveillance" by 
RSF, but It was considered an "Enemy of the Internet" before the Amb 



regime, there was pervasive filfering on political, and social, according to 
OpenNet III2I1. After die fall of Ben Ali, the new government has lifted 
die ban on social networks sites - Facebook and Youtube included. 

surveillance are among top 20 of countries having more malicious hosts is while 
trying to droimvent censorship, users might end up getting their computers 

tools. By becoming infected, such computers may lake part in b otnets, which 
are the source of most of current spEun on die Intpmei ffSH imi. Tn addition, 
one could assume dial censorship in ISPs of these countries is regarded as more 
important than the number of spammers in their networks. 

Country-based Phishing BadHoods 

trated. Analyzing this figure,' we can observe that: 



tof 250, as in Table |4l). 



in advanced economy natio 
a phishing hosts. In addidt 
ong the [op 20 phishing ecu 



ts in a 



The reasons for this difference between Spam and Phishing distribution over 
countries lies in the nature of the application/attad;: spammers typically em- 
ploy a large number of bots to carry out spam campaigns, while phishing has 

redirected to it so they can steal their personal information. 

■ftbleg3] presents die top 20 countries in terras of number of phishing IPs. 
The left side of the table, the countries are ranked according lo the absolute 
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number of phishing IPs, while or the right side, Ihey Eire ranked according to 
the number of phishing IPs per mLlion inhabilanis (Ratio^SHCs/tPopxlO"')). 

When analyzing the righl side ofTableg]!] we can notice that 13 countries 
listed have iess flian 20 phishing hosts, which are mostly classified because of 
their small population. The odier 7 countries have observed at least 56 hosts 



4.5.2 City-based Internet Bad Neighborhoods 



set from honeypats obtained' from Quarantalnenet 11251 . a Dutch 
i countries having more mahclons hosts. 



City-based Spam BadHoods 



We srart by showing Ihe dislribution of the [op 40O cities in terms of spamming 
hosts, as can be seen in Figure gT7| In this figure, the siie of each circle is 
proportional to the number of malicious hosts". As shown in IShle |431 1-6 
million /32 IP addresses (out of 9.3 million) could no! be mapped into cides for 

Analyzing in Figure gT] we can observe chat the majority of the top 400 cities 
are located in India (88, to be more preci se), followed by Brazil with 46, then 
34 in Russia, and 19 in China, TkblegS] shows the number of spamming host 
for the top 20 ddes. As can be seen, atnong the top 20, 6 cities are located in 
India, 2 hi Saudi Arabia, 2 in Pakistan, and 2 in Brazil. 

Comparing the results of tiiis table to Ikblej^Ccountiy-level BadHoods), 

of spamming hosts are located m the countries that ha^e the largest number of 
spamming hosts as wsU. 

Thble 14.101 shows the cities that have most spamming hosts per million in- 
habitant!. To o btain the population pet city, we have used Maxmlnd's World 
atlas database dTI, due to easy of use. Except for cities #1, #5, and #10, 
the dties presented a high ratio due to the small population (< 31,n(tn) and 
small number of sources « S.Tfifi). Cities 1,5, and 10 are located in India 

■ftbleg;! ' Pi' 



City-based niishing BadHoods 

The result! for dUes having more phishing hosts are shown in FiguregTs} In this 
figure, we show die geographical location of the top 400 cities having phishing 
hosts, outof a total of 437 (as shown in Table g]!). Tableg3l|llsts the top 20 
dues having tnore phishing hosts In absolute numbers. 

As can be seen, there is a concentration in American dties; the top four 
cities in terms of phishing hosts are Dallas CTX), Chicago (E.), Prove (UT), and 




6; Tbp 400 Phishing CiEy-Ba 



Houston [TX), all in the United States (in red). These cities all havedala centers 
within t heir bo rders. 

Table |4'12| preienl8 the cWes having most phishing hosts per mDlinn hihabi- 
tants. Cidesfa bold can also be fouQd in mie HTTTl As can be seen, [he majority 
ofcitiesarsstiUin [he United Stales. One surprise was [he city nutnber 1, Road 
Town, located at British Virgin Islands, a small island in the Caribbean area. 



4.6 Related Work 

The work presented in this chap[er was inspired in a previous work conducted in 
our research group 11241 . In this work, we have carried out a study on the most 
"evil" cides on the Internet - that is, cities that originated most of the observed 
attacks. Fo r that work, we have employed information obtained from Quar- 
antalnenet 11251 . a Dutch company that develops network management and 

customers, including more than 50% of the Dutch universities, Quarantainenet 



Table 4.1 Or Top 20 Spariraing Cities (Ptoportional) 



has a honeypot mfrastmeture which is distributed masdy over rhe Netherlands, 
In talgl, 125 machines are used for this purpose, hi tbis chapter, however, we 
use different data sets for a differenr monitoring period. In addition, we have 

cording to dieir AS and organization. 

carried out by Shin et oL ET). In that paper, the auUiors analyze infection data 
for three botnels (Conflcker, MegaD, and Srizbi), and carry out a comparative 
analysis from eadi of them. They have shown how the botnets are distributed 
over the IP address space and also the countries in which most of the bots are 
located. In our work, we determine the number of malicious hosts per country 

Most of the current teaeaich works focus on geographical location at the 
country level. For example, Jiang et ai. I12S1 propose a spam filtering technique 
that uses country-level geographical informadon, which leads to a reduction of 
13,9% in their experiments. Even though they were able to reduce the number 



4.6 Belated Worl. 



of apam messages, Ihe authors do nor describe what could happen if dty-level 
informadon would be used instead of country level for Jiltenn^ spam, 

cation data for spam detection. It is stated in the patent that "the geolocation 
data may be any lype of geographical information such as city, country, state or 
presence within a pre-selected radius of a geographical point". As a patent, the 
method is only described whQe its effectiveness is not addressed. 

Other reports on the number of attacks per country also exist. I^r example, 
the Internet hosting c ompa ny Akamai provides a quartely report named "The 
State of the Intetnef 11291 . wMch Is obtained ftom the analysis of users that 
access Akamai serveis (many sites, such as Hulu, BBC iPlayer and Ihe White 
Houseuse the Akamai content distribution network). In their latest report, they 
have observed attacks from 209 countries/regions, with the U.S. being die pre- 
mier one. in terms of traffic (12%). However, only 10 countries are mentioned 
In the fepoit, and they do not provide an analysis at dty level. Quarantainenet 
also provides a daDy map of the countries thai have attacked their honeypot 
infrastructure IHSol . 
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Finally, Koike OaL BED perfonn date visualilation on the origin of attaclls 
al IP block level or country level. And Miiir e! al. present a survey on the 



4.7 Conclusions 

behind the Internet Bad Neighborhood concept. We have employed real world 
data sets for spam and phishing in our analysis. 

To investigate the assumptions, we proposed two research questions. In RQ 
4.1, we have asked JiDivmoiidous hosts are dUlrtbiited overlSPs? lb answer that, 
we have aggregated individual addresses into both ASes and organizations and 
compared the results. We found that the top 20 ASes couteniraie almost 50% 
o/aJispararaingiPaddresfej observed ill our data sets, from a total of 42,201 

To make sure the malicious IP5 announced by the ASes were not from other 
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a strong correlation between die AS -based BadHoods and orgamzation-based 

be used by network engineers to develop tools to Biter trafHc based on the ASea 

behind tbe BadHood concept - that some ISPs neglect malicious traffic and 
malware propagation m iheit networks. 

In RQ 4.2 we wondered how molidous hosts are diirribured over geographiail 
areas fcoimlries and cities). We found that the results depend on the appllca- 

the world (concentrated in Asia), while phishing BadHoods are concentrated in 
fcVi of countries Euid dlies (mostly in developed nations). This shows that one 
can not generalize assumptions for BadHoods, and the application in question 
should he considered [we provide a detailed comparison of BadHoods and the 
different applications m Chapter^. The reason for that is diat phishing re- 
quires a reliable available web server in which forged websites can be hosted; 
therefore, malicious users choose to bost them in data centers - which are con- 
centrated in very few countries/cities. In addition, for spam, we found that the 
top 20 countries were responsible for concentrating more tiian 75% of all spEun- 
ming IP addresses, vriiich shows how evident is flie eitistence of BadHoods at 

In addition, we found that the BiUC countries are among the countties with 
the highest number of spamming IP addresses. Given their current economic 

m these countries If measures are no! lakeit to improve the security in such net- 
works. For example, if India would have the same Internet penetration rate 
as die United States (79%), it would have, alone, 20 million spanlming IPs - 
twice as much of as what Is observed today for the entire world. One might 
■wonder if this is not the case of a silent ticking bomb. Moreover, we have found 
that 6 countries among the top 20 in terms of spamming hosts employ Internet 
censorship measures. 

Therefore, die results from RQ 4.2 confirms our last assumption about the 
existence of Internet BadHoods: that non-technical local factors play a role to 
the BadHoods: phishmg IP addresses ate located in advanced economies, while 
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CHAPTER 5 



Case Study: Spamming Bad 
Neighborhoods^ 



AFTER presenting general characteristies of Bad Neighborhoods In previ 
ming Bad Neighborlioods. 
We focus on Spamming Bad Neigliborlioods CSadHoods) due to ihe im] 
caused by worldwide spam. Spam comprises approsimacely 90 lo 55 percer 
all e-mail traffic on the Internet nowadays lll321U jjl. To deal with all this m 
licited e-mail, companies have to spend computer and network resources, 
human labor hours, which cause economic losses. It is estimated that wo 
Widespam causes losses from SIO billion to S87 billion yearly, and 



t RQ 5.3t Do Spamming BadHoods with many spar. 

spam messages? 
• RQ 5.4: How much data do we need to identify Span 



served on the Internet: Low-Volume Spammers {LVS) and High-Volurrre Spam- 
mers (HVS) 111221 . The first type describes hosts "working under a central provi- 
sion, each typically spamnjng wCih a low volume", while the second one consists 
of "dedicated spam sources, which aie brute force spammers, each spEoimiing 
in an enormous nirmher every day". Since mosi of the IVS tend to he part of 
botnets 1022111231, concentrations of IVS reveal what are the "most infected 

fore, this first deftiiUon addresses RQ 5.1 by idenrifting LVS HadHoods. i.e.. 

To answer RQ 5.2, we introduce a second definition: HVS BadHoods. This 
type of BadHood allows us to identilV providers that ignore or tolerate ded- 
icated spammers that spam at high volume in their networks and, therefore, 

next definition - Spammlng BadHoods Firepower - which considers the total 

spam messages and spammers per BadHood, we can determine if there is any 
correlation between those two. Finally, RQ E.4 leads to our last definition - 
All Spamming BadHoods -, in which we compare the BadHoods obtained from 
different data sources. 

The test of this chapter Is structured as follows. Section go] details the four 
definitions of Spamming BadHoods. Next, Secti on |0| presents die datasets 

results. Related work is discussed in SecdonH^) Finally, Section [Q] contains 



5.1 Four definitions for Spamming Bad Neighbor- 
hoods 



into the behavior of spammers and the networks hosting them. We begin with 
a brief dis cussion of the possible data sources to evaluate Spamming BadHoods 
in Section |5Xl| Th e n we p resent the four definitions of Spammmg BadHoods 
from Sections IS. 1.2l to IsXsl 
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5.1.1 Possible Data Sources for BadHood Analysis 

In order to identify and analyze Spamming BadHoods, we need to obtain the 
IP addresses of the spamming hosts. Several data sources can be employed for 



DNS Blacklists 

Their source IP addresses can be used to build blacklists, usually in real time. 
The tenn "DNS Blacklist" comes from the fact that many blacklist malntainers 
allow queries to be made to their blacklists in a similar way DNS queries are per- 

do not necessarily list the full IP address of every single spammer. In fact, some 
lists only provide aggregated mfonnation on whole subnetworks. Even though 
DNS bladdists list many IPs, they do not provide [he information on how many 
spam messages a spammer has sent - ihey only tell that a certain IP has sent 



sages are processed and filtered. Mail filters, such as SpamAssassin I6H , are 
configured to perform a series of checks on eveiy e-mail message. These tests 

from DNS blacklists. Depending on die oiTtcome of the^ tests, each message is 
classified as "spam" or "ham", Difftrently from DNS hlackHsts, It is possible to 



Mail aienc Logs 

Spam mails can also be detected by the maL client itself. This is usually the iast 

caused by unaoiicited mails. Similar to the mall filters used by mail servers, the 
mail client, such as Thunderbird ItlsEII . can perform a series of tests in order to 
classily the mails. 
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Network Flows 

According lo the IETF, a network flow is defined as a "sel ot IP packels passing 
an observation point in the network during a certain time interval that share 
the same properties" 153]. lypicailK these properties include Che source and 
destinatian port and address of the packets, as well as other IP header fields, 

Eiport so-called flou records, which contain summarized information on the 
identified flows, such as the number of exchanged packets. 

Flow monitoring can be used to detect spam 0321 IHSl ES) . However, 

ralidation of die detertion results is m^^^arder and has to be based on sratis- 
5.1.2 First De&nitlon: LVSBadHoods 

Low-Volume Spammers CiyS) are hosts that spam at a iow volume to avoid be- 
ing blacklisted. Typically, diey are operated under a central provision, usually 
as part of a hotiiet l|122l , The latter obviously requires that the host has to be 
flrsfly infected. Hence, a concentratioa of a high number of IVS m a subnet- 
work indicates tliat the particular subnetwork is poorly protected or managed, 

networks. The goal of this definidon for Spamming BadHoods is to detect the 
worse protected (or Infected) subnetworks by identJIylng the netblocks with 
many LVS. We refer to such BadHoods as LVS BadHoods. 

The first step is to classify spammers according to the number of messages 
each of them lias sent during the observation period. Note diatthis mformation 
isnot provided by blacklists, so we have to rely on the other data sources. Since 
spammers can behave differently across different domains, we combine the data 

threshold 6 that we apply to the number of sent messages in order to separate 
LVS from other spammers. We define 

tl=dx,y.m (S.l) 

different domains being monitoredrand m is the maximi^n' number of messages 
that a spammer can se nd lo a sin gle domain pet day in order to be consideied 
alVS. As described in QMBSl, a LVS usually contacts the same mafl server 
□nee or twice a day Hosts spamming under this threshold are classified aj IVS. 
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5.1.3 Second Definition: HVS BadHoods 

Complementary to D/S. High.VoiLme Spammers (HVS) ate those hosts that 

period than specified hy the threshold S in Equation |5.1| As for IVS, blacklists 
cannot be used to identify HVS. 

HVS are usually dedicated spamming hosts operated hy professional spam- 
mers. Therefore, a higli concentration of HVS in a particular subnetwork in- 
dicates that the ISP tolerates them. To identify HVS BadHoods. we follow the 

spam messages each spammer has sent during the observation period. Spam- 
mers above die threshold » are considered HVS. After dassitymg each host, we 
count the number of HVS per /24 netblock and rank them according to diat 
niimber. As In the first definition, the maximum score for a netblock is 254. 



5.1.4 Third Definition: Spamming BadHood Firepower 

power" of each netblock. Therefore, we identify the most spamming BadHoods 
on the Internet in terms of the number of sent spam messages — not on the 
trambet of spamming hosts. As for the two previous definitions, we have to rely 

not provide all needed mfbrraation. The first step is io count bow many spam 
messages each spammer has sent. Next, for each /24 netblock, we calculate the 
total numbet of spam messages sent by all the spammers located in the block. 
The final step is to rank the blocks according to iJiat numbet. 
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5.1,5 Fourth Definition: All Spamming BadHoods 

The goal of the bit definition for Spamming BadHoods is to identify all spam- 
ming nelblCHZks, independently of the spammets' behavioc Therefore, we need 
only the IP addresses of spammers, which allow us lo include the infoimatioii 
provided by DNS blacklists into our analysis. From each data source, we estiact 
Ihe IP addresses of the spammeis. Then, all these IP ate combuied and dupli- 



5.2 Evaluated Datasets 

obtained data frotn DNS blacklists, mail server logs, and mail client logs ftom 
various sources over a period ol^ one week (April 19-26th, 2010], NeJtt we 



5.2.1 DNS Blacklists (DNSBL] 

Table milshows the DNS blacklists we have obtained. We have chosoi these 

iJiey ate employed in production mail filters. The first one, Composite Blocking 

source IP addresses of spammets are harvested. To give an idea of the size of 
dieir spamtraps, one of the four traps they maintain has received, on aver age, 
2S31 spams per second over a period of one year (H3). As shown in T^bleg^J 
on April 21at more than S million unique IP addresses were listed on CEL. 
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Another blackiist we have obtained is the Passive Spam Block List (PSBL) 
I7SI . PSBL is built using their ovm spamlraps, which capnire around 500 thou- 

Were liswd by PSBL on April 21st, 2010, as shown in Thble|n| 

The neit blacklists are from UCEPROTECT-Network (T?21 . They have three 
blacklists; The Level 1 bladdist lists only single IP addresses (/32) of spam- 
meis that have contacted their spamtraps. For example, on April 2l!l:, 2010, 
mote than 3 million unique IP addresses were listed ou the Level 1 blacklist. 
The Level 2 blackUat, on the other hand. Is automatically generated based on 

procedure. The following entry is present on J^ril 21st, 2010 on the Level 
2 blacklist: "86.99.128.0/17 is UCEPR0TECT-Leyet2 listed because 267 abusers 
ore ftosted by EmtATES-imEHmr Emirazes Internst/ASBSSii there.: Finally, 
the Level 3 blacklist list! all IP addresses from an autonomou! system [except 

muniof0.2?6 of all IPs allocated to thfs ASN got Level 1 listed' within the last 7 
days" CM .. 

Fhially, the Spamhaus Block List ISBL) 1051 lists IP addresses at different 

of^tronlc mail- iWI . SBL CMtains single IP addresses as well as 'entire 
network blocks. As can be seen In TablegTl) more than 10 thousand entries are 
present on SBL on AprQ 21st, 2010. 

Since blacklists are usually built using a large number of honeypots, they 
have higher probahlUry to be reached by many different spammer? than a single 
mall serves For example, UT/EWI mail servers have been spammed by 71,754 
different IP addresses on April 21st, 2010, while CBL spamtraps lists more than 
8 million unique IP addresses. However, DNS blacklists list only the IP addresses 
of spammers, whae mail server logs list every single spam message - which 



5.2.2 Mail Servers Logs 

Table EH shows the maU servers from which we have obtained data. Provider 

sages from more ^an 1,5 million different IP addresses were tagged as spam 
for the monitoring period 11 week). 

Next, we have analyzed data from the mafl server of the Electrical Engineer- 
ing, Mathematics, and Computer Science Faculty at the University of TVrente 



(UT/HWl)^. In total, more Ihan 1.7 mUlion messages were logged. 

Since we did not wanl to limit this research to data from mail servers from 
the Netherlands, we have also obtained data from the mail servers of the Se- 
curity hiddent Response Ifeam of the Brazilian Research Network [CAIS/RNP) 
11441 . More than SO thousand spani messages were ohtained for the monitor- 
France, denoted as Provider B, from vi*idi we got 1,160 spam messages from 



5.2.3 Mail Client Logs 

Finally, the last type of data collected was maQ client spam logs. Rir this work 
we have obtained 1,321 spam messages from IS mail account! from various 
countries. These messages, in turn, came from 763 differoit senders. Since this 

further employ it in our analysis. 
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5.3.1 LVSBadHoods 

For this delnidon, we have combined mail server logs from four different do- 
mains = 4): Provider A, UT/EWI, CAIS/FINP, and Provider H. By comhining 
those log flies, we increase the chances of observing the same spammer on dif- 
ferent mail servers which allows us to better classify it. The logs cover a period 
dF seven days (rf = 7), 

l^blelSjlJshDYis the distribution of the number of spam messages per spam- 
miug hosts for the ccrabined mail server logs. Fbr a given number i of spam 

(second column) that match it, followed by the total number of spams sent by 
those spammers [third column) and the average number of spams per spammer 
(fourth column). The numbers in parentheses give the percentages of the total 
number of spammers and spam, respectively, found In the dataset. 

domain per day, as described in Eectjon |5X2| Hence, Equation |5. 1 1 leads to 

most 56 messages are'^classifled Is IVS while die others are HVS. By employing 
9 = 56, one can observe that most of the spammers (99.3%) are classified as O/S 
(first to third rows m Ibble |0} and that they are responsible for around 80% 
of all spam our mail servers have received. Since most of IVS are believed to 

spam nowadaj^ comes from bomets. In addition, we observe thatnearly SD% of 
all spammers have only sent one message (first row), which confirm s die tactic 
adopted by I¥S to spam at very low volume to avoid being detected 

Figure[0]shows the distribution of lys BadHoods over the IP address apace. 
The :c-ajis gives the /8 prefli of the IP address; the y-sns gives the number of 
spamraing LVS hosts per /24 block. Each point in the plot stands for one n't 
block. The horiiontal line shows the maxhnum possible number of hosts in a 
blodt, that is 254. We observe that there is a high concentration of spamming 
hosts on certain ranges, such as between 60-100, 110-125, and finally 180-200. 



Since mosl of IVS are believed to be bols, these results siiow how some ISPs 
neglect ihe propagatioii of bois and malicious activities that the hosts in their 
networks carry This also conflrms the facts that some DNS blacUisB blodi 
entitely /24 netblocks, as S BL flggl . 

As esplamed in Section |5.1 .31 we can assume that the most mahcious IVS 
BadHoods are also the worst protected and, consequendy, the most infected 
ones. The presented results can be used by providers to raise awareness about 

hoodscan'alsobeempioyed to track an^delectbotnets (M), 



5.3.2 HVS BadHoods 

HVS BadHoods are determined in a similar way as the LVS BadHoods. We 

■lSble|s]^ only 13,081 (0,70%) IP addresses have been classified as HVS. The 
distribution of the HVS over the IP address space is visualized hi Figure |0| 
Each point gives the number of HVS In one /24 block. We observe that most 
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BadHoods host less than three HVS. Remarkably, some blocks contain up to 27 
HVS - which is far less than for LVS cases, as shown in Table |e7| 

hi the most right column of T^blelS^I we show the top 20 "spam-friendly" 
providets. Differently from LVS BadHoods, we can find providers for HVS from 
Europe, Africa, Asia, Russia, and South America among [he top 20. Even though 
die European Union has a directive that regulates spam ^l.each memberstate 
is responsible for "inking appmpriiUe measures to ensure that [...] unsoilciled 
communications purposes o/ifirecf morlceting f..,J ore not oilowed citiier with- 
ourrte lonsento/ the subscribers". Our results show tfiat 5 of the top 20 HVS 
BadHoods are located widiin the EU borders, which raises doubts on the effec- 

Another inieresdng fact to observe is that Yahool Europe ranks number 8 in 
die Top HVS list. Checklns manually the 10 IPs, we found nul they are, In fact, 
mail servers ftom Yahoo! Mall located in the UK. This might be due to account 
hijack, in which spammers hijack legitimate accoimls to send spam 0145111461 . 

To conclude, HVS BadHoods show in which blocks HVS are located, dius 
allowing us to identily "spam-friendly" providers. However, a complete analysis 
□n that Is provided In Chapter^] These results can be used to raise awareness 
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Figure 5.3: Spamning BadHonds Firepower 



accounts. This could be lised by maU filters lo appropriately rank mafl from 
such netblocks and by ISPs. 



5.3.3 Spamming BadHood Firepower 

So far, we have analyzed BadHoods based on the number of spamming hosts per 

te-, dieir impact measured in number of spam messages diey have sent. Again, 
"we rely on the raaQ server logs to calculate the total number of spams sent per 
/24 netblock. The result is shown in Figure |3] Each point represents one /24 
block. Note the logarithmic scale of the y-axis. 

In the beginning of this chapter, we have raised the question whether the 
Spamming BadHoods with most spammers are also responsible for most of the 
spam. In Figure |s]4} we show the number of spams sentby a/24blockasfunc- 



5.3 EKperimental results 
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Figure 5.S: Spam CDF 



As 3 matter o f fact , these results show the strength of the Bad Neighborhood 
concept. In TablegJ) we observed that 46,94% of spammers 1/32 hosts) gener- 
ate only 9.S%of the total amount of spam -which poses a major challenge for 
DNS blacklists-based spam detecdgr. However, by employing tiie Bad Neigh- 
borhood concept (/24 netblocks), we are able to invert this situation: we found 
that 10% of Spamming BadHoods were responsible for S4.B7?4 of all spam. 

Given the threshold of ^ = 56 spam messages, when ranking badhoods ac- 
cording to their firepower, many of them that have a small number of spam- 
mers must, therefore, contain HVS. In fact, among the top 10 worse Spamming 

one ha! nine, and the last one has 32 spamming hosts. 

This is also sliown in Figure |5 .41 We can see [hat the most severe Spamming 
BadHoods are. In terms of number of spam messages sent [or firepower). HVS 
BadHoods. However, most of BadHoods are classified as LVS. Even though die 
HVS BadHoods are a minority what we can learn from these results is io not 
underestimate the HVS BadHoods firepower and the damage that they can in- 
cur. On the other hand, the average firepower of the itidividual IVS BadHoods 
is far lower. li'S BadHoods become powerful through their sheet number. 




Figurf 5.6' Ml Sparaming BadHoods 



5.3.4 All Spamming BadHoods 

As esplained in Seclinn [sXsl the goal of [he fourth deflnition for Spamming 
BadHoods is to identify all spamnung netblocks, independently of the spam- 
mers' behavior. For our analysis, we used the data from the 732 blacklists (CBL, 
PSBL, UCEPROrECT-1) and the maQ server logs (Provider A, UT, CAiS/ENP, 
Provider B) coveiinB the period of April 19-26, 2010. This resulted in a list of 
more than 124 million entries with IS million unique IP addresses. We aggre- 
gated the data by counting the number of spammers for each /24 block. By 

ure ^3] shows the distribution of the spammers over the IP addressing space. 
Each pomt gives the number of spammers of one /24 block. The jr-aiis spedfies 
die /a prefix of die blocks. 

Our mam motivation for this deflnition is die question, how much data is 
actually needed to identily all Spamming BadHoods. Comparing Figure ^l6| wtth 
FigurelTl] which has been generated only ushig mail server logs, we obseive a 

with high, respectively lav4 activity. In total, mall server logs have allowed us to 
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idendly 571,389 SpamminK BadHoods, while DNS blacklists combined together 
with mail server logs allowed us to detect 1,20S, 932 Spammtng BadHoods. The 
difference of 634,543 between the nvo shows how much extra BadHoodt have 
been identified by using the additional information ftom DNS blacklists. 

However, we should put this uito petspective; the blacklists we have used in 
our anal^is have provided more than 1 IS million entries, while the mail servers 
have provided us only 8,7 million C a factor of 13), But, by using the blacklists, 
we were able to identify only 2,11 limes more BadHoods, We can conclude diat 

well when finding Spamming BadHoods. 




iin 



spamming hosts per /24 prefix. The resulting counl was then transformed into 
a score for the /24 netblock. Together viith other data, they employed this score 
10 determine whether an e-mail would he spam or not, based on its sender's IP, 
If the message originated from a "bad neighborhood", (.a, from a/24netblodf 

never observed as spammer before. Our worlt goes beyond this worlt, by defin- 
ing and analyi^g four types of Spamming BadHoods. The results provided in 



5.5 Conclusions 

sented ui Chapter|T) usmg as a case study Spamming BadHoods. We have raised 
four research questions that led us to four definitions for Spamming BadHoods. 
The RQ 5.1 was "What are ifie worst pmlecled nelbiocb". We have defined 

that some ISPs seem to completely neglect the malware propagation in tfieir 
networks- We can use these result! to raise awareness on the most infected 

RQ 5.2 focused on "what are the most spam-friendly prDViders". We have 

providers. The presented results can be used to raise awareness about those ISPs 

2002/Sa 1^. The HVS information coidd used by mail filters to block, or 
at least appropriately rank, mails from such BadHoods. 

RQ 5.3 question was whether "Spamming BadHoods with many spammers 
also send many $pam mesia^?". Out analysis revealed that this is net the ease. 
In feci, the top 10 Spammmg BadHoods had no more than 32 spamming hosts 
(and 6 of them had only one, including the most active BadHood). In addition, 
we have shown that most spam comes from a fraction of all BadHoods. The 
lesson we can learn is that we should not underestunate HVS BadHood fire- 
power. And that the list of HVS providers should he used to raise awareness 

Finally, BQ 5,4 addteased "how much data do we need to find Spamming 
BadHoods". We have shown that DNS hladdisls help to obtain twice as much 
BadHoods than when only relying on our mail server logs. However, they have 
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listed 13 
vide mon 




lough bliiddists pro- 



BadHoods. 

For more Ihan 15 years, Che Internet community has been lighting spam, and 
the problem is siIU tar from being solved. In this chapter, we have provided an 
insist on Spamming BadHoods, taking into account spammers behavior and 
Firepower. Tlie dehnidoru and results presented can be used to reflne current 
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Part III 



Defending Against Bad 
Neighborhoods 




CHAPTER 6 



Bad Neighborhood Blacklists from other 
Sources^ 



INTEHNET bladtlists [Ibts of malicious /32 IP addresses) can be obtained ftoni 
Ma^ii^<3» and peer iourcm.^blic^sonTcesaiediose that generate blact 

make them publicly avaPable on the Internet. Ftor example, the Passive Spam 
Block List (PSBL) (29 is a public spam blacklist, among many available on the 
Internet. Peer sources, on the other hand, are sources that generate blacklists 

a blacklist (/32) is aggregate into a bad neighborhood, we refer to thai as Bad 
Neigbhorhood blacklist (BadHood blacklist). 

oger can rely on others' BaiHood blacMisa to secure a host that he/she raauitajns 

refers to external sources from which we can obtain blacklists. 

BadHoods attaddng the (to be secured) target are also listed on BadHood 
blacklists from other sources, the network administrator could then effealvdy 
employ such blacklists to feed BadHood-based defense mecbanisms - such as 

target's own bladdisl (target worst^ender list (TWOL)), since IJacklists ob- 
tained from other sources could be employed. Since attackers may check the 
IP addresses under their control against publicly available blacklist! - and re- 
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locally generflled blacklists improved detecdon rates. In the contest o^f this dis- 

niles employed by van Wanrooij and Pras ED- ' ^ 

Therefore, the main research question Investigated in this chapter is the 

following: how much can a network adntinislratar rely ort BadHood btackUsts 

obtainedfrom otfier sources to protect a target? 

One of the advantages of public blacklists is that they ate usually gener- 

which increases the probability of blacWisting more sonrces. Peer sources black- 
lists, on the other hand, are usually generated based on the mcoming trafBc to 
one or fevif targets. However, there are some disadvantages of employing black- 
b'sls from public sources. The main one is the fact that the dependabihty of the 
security solution designed to protect the target is put at stake, since it relies 

blacklist sources might fail for various reasons - a disruption in the service can 

due to bad weather conditions the pubUc source might become victim of 

DDoS attacks, or change their business inodel and charge for access, or even 

endorsed by a private agreement. 

lating into account the source of the blacklists, we divide the previously 
raised research question into two sub-questions: 

1, RQ. 6.1: How much can a network administrator rely on ptiijfic BadHood 
blacklist! to protect a target? 

2, RQ. 6.2: How much can a network administrator rely on peer BadHood 
blackUsts to protect a target? 

Figure |5TT] presents a summary on the usage of both public and peer sources 
far protection of a target In this figure, the target is protected from the Internet 
by a BadHood-based firewall, which employs public or peer sources as input. 

To answers the sub research questions, we first identJly both public and peer 
sources from which we can obtain blackhsts. Next, for each source, we obtain 
daily blacklists for the same monitoring period (I week). Then, we compare the 
public and peer BadHood blacklists to the target wotst offender list (TWOLl, as 
shown m Figure|6j] The idea is to determine if the attacks occurred m a similar 
way - meaning the same number of attacking BadHoods, the same Instances of 
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these BadHoods, and the same intensiW This, ultimately, allow us to answer 

Theiest of this chapter is organized as follows: in SecUongll we present 
the public and peer sources Itom which we have obtained blacklists. After that, 
we introduce in Section ^ the methods employed to compare the BadHood 
hlsckllsts. Next, we address RQ. 6.1 in Sectlon|63] In which our results and 
analysis are detaUed. Later, RQ. 6.2 is investigated in Section |63 FinaLy, Che 
conduding remarks are presented in Section [6i5| 



6.1 Blacklist Sources 

In Section |6XT| we present the public sources from which we have obtained 
blacklists. After that, i n Section |6.1.2t we present our blacklist peers sources, 

the ta rget TWOL to both public and peer BadHood blacklists. Finally, in Section 
[5X^ we describe Che measurement period and how we preprocess the data 
before carry out the analysis. 



6.1.1 Public Bladdist Sources 

There are many blackUscs avaLable on tlie Inceraec, Some web-sites, like Unified 
eMail 111 501 . for example, even provide an interface that allows one to query a 



single IP address against more than two hundred differenl spam blacklisls at 

We have therefore to choose what blacklists to evaluate. To do it so, we have 
employed Ihe foUowing crileria: 

1. Monitored applications the mast popular type of blacklists are the spam 
SSH, so we can determine if the same results hold. 

2. Prior usage on both academic and/or Internet security coramunitiest 
m order to filter out blacklists of questionable reputation, we only consider 
blacklists that have been employed by Internet security systems and/or 
previously investigated by the research community. 

3. Method of access: we only evaluated blacklists that could be obtained as 
a bulk single file. The method of access lo blacklist varies, but mostly it 
is provided in a DNS-lIke feshion El, In which one queries the blacklist 
server if a certain IP is blacklisted or not. In our case, blacklist that only 
provide DNS-like access are disregarded, because, by this mean, we are 
not able to effectively obtain all the blacklisted IP addresses. 

• Composite Block List (CBL) : CBL is operated by "a group of computer 
security, spam and virus professionals, dedicated to developing and main- 
taining an anti-spam and anti-virus DNSBL CDNS bladdist) of the highest 
possible quality and reliahilily, that large organizations can use with con- 
fidence" EQ), It lists /32 IP addresses that have reached iheir spamtraps. 
The number of traps and their location is not disclosed, but it is dislribuled 
over different networks and countries. CBL has been employed in a num- 
her of studies, mcluding Il6ai^l64ll65ll66l. 

• The Passive Spam Block List (PSBL) 123: as CBL, PSBL also maintains a 
blacklist of /32 IP addresses that have spammed PSBL dlsirihuted traps. 
They do not provide more details about their infrastructure and the orga- 
nization behind ii, PSBL has been also investigated by our research group 
in two papers ISsllSfil . 

• DShield.org (Dshield) 11511 : DShield Is a community shared firewall log 
system. Volunteeis submit [heir firewall logs ftom more than 600 contrib- 
utors, which encompasses more than "500,000 IP addresses (firewalls) in 
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over SO countries" ESJ- It is maintained by the SANS Institute H^, 
and contains security logs ftom many applications. As for Spam blacklists, 
the blacklists IP addresses ate aggregated from many different sources. 
The DShield dataset has been investigated by the research community in 
papers including (155 EHIHSJ . 

SSH Blacklist (SSHBL); SSHBL maintains "19 hosts distributed over Eu- 
rope, US, Australia, and China, and receives an average of 90 SSH bmte- 
force attacks per day" <I55I . As the odier public sources, SSHBL also does 
not provide more information about it; infrastructure. 



6.1.2 Peer Blacklist Sources 

Diffeiendy from die public sources, the peer sources for blacklists are those 
that one might have private agreements to share data with. We have obtained 
blacklists from the follDwing sources; 

• Provider A: Provider A Is a major hosting provider m the Netherlands, We 
have obtained spam blacklists generated after proceasmg their mail filter 
log files. As specified m die private agreement, we are not allowed to 

• University of TVenle/EWl CUT/EWI) OS: As for provider A, we have 
obtained the IP addresses of spammers chat have reached the mail server 
of the Electrical Engmeering, Madiemalics, and Computer Science Faculty 
of die University of TWente, ra die Netherlands. 

• Security Incident Response Team of die Brazilian Research Network ICTI 
[CAIS/RNP): DIfferendy from die previous sources, this one is located in 
Brazil. We have obiained Uie malicious IP addresses from ttieir mail server 

log files. 

• QuarantaineNet (QNET) EM'- QuarantaineNec is a Dutch company diat 

controfand malware control to thei/customers. They^'raainlain a honey- 
pot infrasmicmre widi 12S traps distributed mosdyover the Netherlands, 
in this data set, each Individual trap Is seen as a single blacklist source. 
The monitored types of attack mclude SSH, MySQL, and Windows vuhier- 
abflides. 
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6.1.3 l^get Blacklist Sources 

we need as Input (if^blic and peer BadHood blacklists, and (li) largets to be 
protecTfld. Tliese targets should be, preferably, real world productioD servers, 
to reflect what is observed on the Internet, in addition, such targets should be 
application servers or honeypots for at least one of the applications we have 
obtained blacklists. 

In this chapter, we have considered situations in which the network manager 
of Provider A, tIT/EWI, and CAIS/RNP tries to protect his/hets targets. For 

honeypot infrastructure^ ■ ' eyp 

¥oi each target, we generated a TWOL blacklist based on riie history of in- 
coming traffic (in the case of honeypot) or in the application server log files 
(for die mail servers). Then, we compare It with the public sources and the 

generated iirom the same host. 

6.1.4 Blacklists Collection and Pire-processing 

For all blacklists sources, we have chosen a cominon monitoring period. This 

For all sources, we have collected data for a period of one week, which is long 
enough to observe a significant number of events. Inaddidon, in this chapter we 
ordy compare blacklists that belong loB same application (we compere blacklists 
from multiple applications on Chapter|7|. 

R)r the Spam experiments, the monitoring period was from April 19th to 
April 25th, 2010. R>r the SSH experiments, the monitoring period was also 
one week, but from November nth to ISlh, 2011. Even diough we have two 

compare blacklists belonging to the same dme frame. 

After obtaining the blacklists, we have to parse them (using customized Java 
software and Linux shell tools) and convert them to a common BadHood black- 
list format, expressed by the tuple: (/24ncl6(oct,#D/WDsls). In this format, a 
/24netbM refers to /24 IP address of the BadHood and # o/HosB refers to 

Hosts < 25S)'. ^ 
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In Chapter|3] we have addressed ihe issue of Ihe BadHood granularity - 
and how to aggregate BadHoods to smaller prefixes. As we have presented, 
the aggregation to preBxes smaller than /24 incurs error on the odds of a hogr 
belonging to a certain netblock be malicious or not We have chose therefore 

After aggregating the original blacklists to /24 BadHood blacklists, we can 
proceed wiFh fhe BadHood bladdists analysis, employing the comparison meth- 



6.2 BadHood Blacklist Comparison Methods 

The idea behind comparing BadHood blacklists is to teL how similar attacks on 
the Internet are perceived by distinct blacklists sources. Consider as an example 
the case of Figure |01 In this figure, four blacklists sources (1-4) arc attacked 
by different BadHoods (outer drdes), being the attacks represented by a line 
connecting a BadHood to a blacWist source. As can be observed, not all the 
BadHoods attack the same sources, and some BadHoods attack only one or two 

In order to answer RQ. 6,1 and RQ. 6.2, we therefore compare how two 
different blacklists sources experience the attacks, by answering the following 
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1, Are twodiffeteni bladtlists sources attaiiked by the same number of BadHuo 
(lhat is, Ihe same number of outer circles in Figure [Olf? 

2, Ale two different blacklists sources attacked by the same BadHoods (lhat 
is. exactly the same outer ctrdes In Figure [6^7 

3, If a BadHood Is fdiuid attacWng two different blacklists sources. Is this 
hosts? ^' 



6.2.1 First Method: BadHoods Distribution 

The first comparison method focuses on analyzing the BadHoods distribution on 
each BadHood blacklist. By employing this method to two different blacklists, 
we can tell if they are attacked by the same number of BadHoods. We therefore 
compute the following metrics; 

. number of source BadHoods (# of distinct /24) 

• minimum number of mahcious hosts per BadHood (min) 

• maximum number of mahcious hosts per BadHood (mas) 

• mean number of malicious hosts per BadHood (mean) 

• standard deviadon of the numbet of malicious hosts per BadHood (sdev) 

statistical computing (Ml . 



6.2.2 Second Method; Intersecting BadHoods 

nie second method focuses on telling if two different blacklists sources are 
attacked by the same BadHood!. To answer this question, we perform an inltr- 
section operation between the BadHood set of each source, as shown in Figure 
lOlfor two sources (si n S2). 
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ticular BadHood emplayed to attack two different targets. We calculate 
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analyze the dislribution of delta values. 

If most of A values are equal to zero, we can conclude that most of match- 
ing BadHoods attack both targets using the same Quraber of hosts. If most of 
A values are larger than zero, than the BadHoods attack die firet target (DO 

To calculate die A values, we have employed a small program using die Java 
programming language. 



6.3 Public BadHood Blacklists Evaluation 

In this secdon we investigate RQ. 6.1 : "How much can anctwork admmisd-alor 
rely o n pubiic BadHood blacklists to protect a target?" As described in Section 
|6T4l we have obtained various blacklists and dien aggregated dien into /24 
BadHoods. After that, we hav e empl oyed the comparison criteria to answer our 
research questions. In Seclian |6.3.i1 we presents the results and analy sis for the 
first comparison mediod (BadHoods Distribution), whereas in Section [53^ we 
show the results for t he seco nd comparison method IBadHoods Intersection), 
and, finally, in Section we show die results for the BadHoods correlation. 

6.3.1 Method #1; BadHoods Distribution 

In this section we present the results of ihe BadHood distributions. We first 
present results for Spam and dien for SSH blacklists. 



Spam BadHoods 

When comparing BadHood blacklist! from public sources to targets, one could 
espect that public sources are likely to have a signif candy higher number of 
BadHoods than individual targets, since public sources, typically, aggregate data 
from multiple hosts to generate a single blacklist. The intuition is diat more 
monitoring hosts increase the chances of observing attacks from different hosts, 
ultimately Increasing the total number of observed BadHoods. Thus, public 
sources should be able lo capture a significandy higher number BadHoods dian 
individual targets. 
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liiblellll^hnm the results for Spam BadHoods. In each column, we have 
the BadHood blacklists obtained from the pubUc sources and the targets , and in 
each line we have the results for the metrics described in Secdon lgirTl 

Analyiing the number of BadHoods in the table, we see that, with the ex- 

Ti^gea blacklists is K^dvely not so big. For example, the rario between CBL 
and Provider A - data sets that have more entries in each category - is roughly 
2, Comparing CBL to die second target in terms of entries (UT/EWIJ, this rado 
is 4,5. The difference is more slgnilicant when comparing CAIS/RNP. a small 

hi the case of DShield-Spam, which is a public source, we can observe Uiat 
It was attacked by a smaller number of BadHoods in comparison to the targets 
Provider A and UT/EWI. The reason for that might be due to the fact that, for 
Provider A and UT/KWI, the blacklists are generated based on the mail server 
logs, while DShield blacklists, on the other hand, are generated based on Hre- 

We can observe that the ratio between the number of BadHoods observed 

DSiUeld-Spam and CAIS/BUP). An interesdng observation is that large data 
sets, like CBL, have observed 1,140,005 /24 BadHoods out of the roughly 16 
million theoretical maximum. This Is a revealing numben it mmns ifiotot least 
6. 79% of oD /24 neighborhoods an tfie Imerml an invohed in Spam. Provider 
A, on the other hand, as a smgle target, has been attacked by 3.2% of aU neigh- 
borhoods on the Internet. 

die number of BadHoods and the mean number of hosts per neighborhood: 
average, die source observes (Pearson correlation coefflcienl EM of P = U.M)- 



^DEhield-SpHra is a mbvl of D£tii?1d - only artflck^ on TCP ports 25 qnd 
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S5H BadHoods 

Tablf|6;2|shows Che tesulls for SSH BadHoods. For this applicalion, we have 
used SSH-BL and DShield-SSH (a subset of DShield dataset, listing TP addresses 
fliat have attacked port 22). The targets, on the other hand, were obtained 
from QNET blacklisls. For the monitoring perind, 16 of iheir 12S honeypots 
have observed SSH attacks. Out of those, we have chosen the top Ave in terms 
of number of attacks (QNET-1 -QNET-S) and performed our analysis. 

As for Spammuig BadHoods, we can observe ttiat for SSH, pubHc sources 
also observe more BadHoods than targets themselves. However, the difference 

of 160.3 when dividing the number of BadHoods of SSH-BL by the number of 
BadHoods of QNET-1. 

In addition, the total number of BadHoods observed for SSH Is also very low 
compared lo fliose observed for Spam. The reason for that is that SSH attacks 
are far less conmion than Spam. Therefore, within the same monitoring period 

single host per /24 BadHood. As expected, pubUc sources detect much more 
BadHoods than targets. We compare BadHoods from different applications in 
more details in Chapter^ 

Rir both Spam and SSH BadHoods, we can conclude that public sources are 
more ttkely to observe more BadHoads than targes due to Che larger number and 
distribution of their monitoring probes. This behavior, in Dim, strengthens the 
idea thai public BadHood blacklists can be employed to protect targets on die 
Internet 



127 




6.3.2 Method # 2: BadHoods Intersection 

In Section [031 we have shown that public sources are likely to observe more 
BadHoods than the largels. However, we wonder if this impHe? that the Erget 

hlflcklists. This leads us Co our second'comparison method. First we present Che 
reeults for Spam BadHoods and then for SSH BadHoods. 



For Spam BadHoods, as observed in T^blejliT) CBLBadHood Blacklist comprlles 
6.79% of the mHsimum theoretical /24 prafises, while provider A covers 3.2% of 
this value. We mlghtexpecta significant intersection between these two sources 
(and the others as well). Otherwise, the implications would be alantiing; if both 
sources are attacked by distinct BadHood sets, that would mean that al least 
10% of all /24 neighborhoods on the Internet are involved in malicious activity 
Not only that, if the same behavior holds for other blacklist sources, we could 
end up having the majority of the Internet Neighborhoods C/24) being classified 
as "bad". 

T he re sults the intersection between the Spam Blacklist sources ate shown In 
Table|63] In diis cable, we show the percentage of BadHoods fiom each target 
(in rows) captared by the public sources (columns) -(targelgnfi^ililir) and the 
absolute number in parenthesis, Wich exception of DShield-Spam, the public 
sources, indeed, capture most of the BadHoods that attack individual targets 
(from 88,03% to 98,74%), From the point of view of the network administrator, 

spam blacklist sources to protect the network. 



Irrelevant Entries 

there is a drawbadi: even though CBL captures 58.74% of Provider A BadHmds 
(Table|53), CBL has still 598,038 BadHoods that did not spam Provider A maU 
servers - which is twice the size of Provider A BadHood blacklist, as shown m 
Table 16.41 Such BadHoods are actually irrelevant to the targets - they represent 
BadHoods that have not attacked the target in the monitoring period. 

gested by Zhang it ai., in which they say that GWOL lists such as PSBL and 
CBL "have the potential to exhaust the subscribers' firewall filter sets with ad- 
dresses that will simply never be encountered" 11541 This may Incur a problem 

switches and routers (CBL non-matching list in relation to Provider A requires 
12MB of storage in plain-text format, but also lookup tune plays a rolel. The 

could aggregate the Hadliood blacklists into smaller prefixes (e.g.,/23./22), as 



Understanding the Intersectiqn 

Figure 13 shows die mtersecdon between CBL and Provider A. As can be seen, 
Provider A is ahnosi a subset of CBL (CBL n A, as shown ui 1^ble|3f. How- 
ever, as shown In Table |6g CBL was attacked by 598,038 BadHoods that did 
not attack Provider A (CBL-(CBLn Al). This number represents 52,45% of aU 
BadHoods attacking CBL 

these two subsets (we have carried out the same analysis for UT/EWl and PSBL. 
and the same conclusion! hold). Intuitively, one could think that there would 
be no pardcular reasons for this distribudon to differ ftom one another. 

As shown in Eectlon [6,2.2| each intersecting BadHood is stored in the follow- 
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ing tuple: l,/2ijiEtblock,#ofHosisDu#a!I!oBtsD.,), in wich IlnntsD,, refers 
to the number of hosts observed for the compared data sets. For this case, Di 
is the CBLdala set, while is Provider A. "fable |01 presents the results of the 

number of hosts per BadHood for CBL (fli), whDe In the third colamn we show 
[he results for complement pari of CBL (CBL CCBL n Provider A)). 

Intuitively, one would expect that the BadHoods having more malicious 
hoBls are more likely to attack different targets (CBL and Provider A, in this 
case), since more malicious hosts would increase the capacity of attack of the 
BadHood, And this is ejradty what we observe. Analysing Table ^ we can 
observe that the CBL parr that intersects with Provider A has a bigger number 
of average hosts. 

a public source ore more (iicly to QtMcit D target. Figures |5|5] and |63| show diese 

We believe that the reason why public sources capture, on average, more 
malicious hosts per BadHood than individual targets is that they employ muttt- 
pk lnrgeta and oggrcgoie data from those, increasing the probability to observe 
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actacks. To Uluatrate this, consider Figure |6]7| In this figure, target A is attacked 
by only oue host (4) belonging to the particular BadHood, whfle CBL sources 
(i, y. and ;) are attacked by hosts 1, 7, and 4, 



SSH BadHoods 

■fable [S:^ shows the results for SSH BadHoods. Even though SSH-BL lists much 
tnote BadHoods than DShield (17.637 against 3,740, as shoiwn In Table 

reflect^ an interesting observatiou: BadHood blacklists having more entries 
do not necessarily lead to more matching BadHoods to a particular target. Since 
we do not know esactly the details of the itifta structure behind each data source, 
we can only speculate the reasons for this behavior - for example, li might be 
the case fliat SSH attackers prefer to choose their targets more strategically than 
spammers 

For both Spam and SSH BadHoods, we can conclude Irom this comparison 



of hosts attacking CBL and Provider A from each BadHood. Then, we analyze 
the distribution of the A values, ss shown m Figure |6.9| In Ihis figure, on the y 
Biis we show the tiumber of occurrences - thai, the nurabet of A values, in log 
scale, while on the ^ axis we show the values for A. 

tadrad by more hosts than Provider A (only in few cases Provider A observed 
mote BadHoods - A < Dl. In addition, we can observe that most af A values are 
located within the interval 10-25] - thai is, most of the BadHoods employed be- 
tween [0-2S] more hosts to attack CBL. We have also carried the same analysis 
PSBL and UT/EWI, and the fonner results holds. 

We have executed the same analysis for SSH BadHoods. However, since the 
number of intersecting BadHoods is very small in relation to spam (e.g., S5H- 
QNETl has 104 BadHoods intersecting with DShield-SSH) we do not present 
the graphical analysis. Out of the 104 intersecdng BadHoods between SSH- 
(JNETl and DShleld-SSH, 95 attacked both with only one host, 2 attacked both 
with 2 hosts, 6 attacked DShield-SSH with 2 hosts while only one to SSH-QNET, 
and 1 HadHood attacked DShield with 3 hosts and 2 hosts to SSH-QIffiT. 

Therefore, what we can leara ftom both Spam and SSH cases is that the 
way a paiticulat BadHood attacks different targets (or set oD depends on the 
applicadon. For Spam, it is more likely that BadHoods always attack public 
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6.4 Peer BadHood Blacklists Evaluation 



|6T3l we assume the point of view of a network adminislralor from Provider A, 
UT/EWI, CAIS/HNP, and QuaiantaineNet. 

In this sectio n, we present our analysis for botli Spam and SSH EadHoods. 
In Eecti Qn 16.4.1 1 we sho w the result! for the first comparison method, while in 
Eecdon EO aiid lg:?^! we show die results for both second and diird mediods. 



6.4.1 Method # 1; BadHoods Distribution 



Spam BadHoods 

number of BadHoods would be similar - since, fn our case, peer sources and 
targets are (ndividuai hosts. Table|6]7|shows the results for the Spam BadHood 
blacklists. Analyzing this table, we can observe that there is a significant differ- 
ence in the number of BadHoods that attack each peer/targei (# of EadHoods 
(/24)): Provider Abas observed 2.2 dmes more BadHoods than UT/EWI, and 

These result shed some light on the modus operandi of spanuiiers: they spam 
more targets with more users. Even though we do not know the precise number 
of users per source, we know dial Provider A has more e-mail users, followed 
by UT/EWI, while CAIS/RNP has the smallest number. 
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SSH BadHoods 

In Section 16331 we have shown that public SSH BadHood blacklists observe 
list much more BadHoods IhEm individual ciigets. Since SSH attack are not diat 
frequent as Spam, one could expect diat peer sources and targets consisiina of 
one monitoring host would observe a small but similar number of BadHoods. 

This intuition is confirmed by the results shown in T^ble |6]a] In this ta- 
ble, each T^get/SouTce is an individual honeypot of QnarantaineNet honeynet. 
Each target (QNET-1 - QNET-5) is located m a different network. As can be 
seen, the number of attacking BadHoods is small (24-110) and, as shown by 

gets. 



6.4.2 Method # 2: BadHoods Intersection 

according to the tatget/peer source. Even though these numbers vary, we sHll 
want to know how many (if any) BadHoods were observed by both targets 
and peer sources (peer n (arscta). To answer this question, we determine the 
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Spam BadHodds 

T^bleg!?! shows ihe lesulcs for the Spsm EsdHoods. [n this table, ive show 
the percentage of the target's BadHoods {rows) that were captured by the peer 
sources fpolumnsj, while the absolute number is shown between parentheses, 

liiat would be enough to detect only 41.6S^ of the Spanuning BadHoods it 

Analyang Tablc|6]9) we can observe that the best results ate obtained only 
when UT/EWI and CAIS/RNP employ Provider As BadHood blacklist. The rea- 
son for that is also related to the number of entries each peer source observes: 
as shown in Table ^ PiDVider A has observed 2.2 and 15.09 times more 
BadHoods than UT/EWl and CAIS/RNP, respectively 



■fable |6j0l presents die number of BadHoods that are irrelevant to the targets 
- that Is, they have not been observed attacking the targets, but only the peer 
sources. If die target UT/EWI were to use Ptovider As BadHood blacklist, it 
would be able to match 91.89% of the BadHoods (as shown in Table^, but 
UT/EWI would n ot obs erve 319,889 BadHoods that only attacked Provider A, 
as shown in Table |6lQl which is equal to 128,45% of Providet As own obsetved 
blacklist. The same reasoning applies to the other targets. 
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6.4.3 Method # 3: Correlation 



The scatter plot for Spam BadHoods is shown in Figure |6.1(3| Each point rep- 
resents an intersecting BadHood, where the tuple (r, ij) represents the number 
of hosts used by then BadHocd to attack Provide A and UT/EWl, respeclively 
The green line [Ratio=l) shows where the number of hosts is equal fm both 
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users (as shown in Section [S^Tst , bul thai Ihey use more hosts lo carry out the 
Bttmks. 

Figure 16.11 1 shows [he distribution of the difference values (A) between 
the number of hosts attacking Provider A and UT/EWl, for the intersecting 
BadHoods. As can be see, most of difference values are located within the inter- 
val [0-2S] - that Is, most of the BadHoods employed at most 2S mote hosts to 
attack Providet A than UT/EWI, and the number of occurrences decreases as A 
values increase. There are some few cases in which UT/EWI has observed more 
attacldng hosts than Provider A, but they represent a minoiily of the cases. 

hi the case of SSH BadHoods tor peer sources, the number of uitersecting 

TSking die results Itom bodi Spam and SSH. we can conclude diat, de- 
pending on the application, peer sources observe different number of attack- 
ing BadHoods, as we have observed when comparing Public source BlacHists 
BadHoods to Individual targets. 




CHAPTER 7 



Bad Neighborhood BlackMsts from 
Different AppUcations 



IN liiB chaptei; our goal is to determine if different types of internet attacks 
are originated by the same set of Internet BadHoods. The motivation for 
conduclins this study is similar to the motivation of Chapter^ to avoid 

Internet BadHoods are responsible for different type of attacks {e.g. Spain, SSH 

tailored BadHood bladdists, and employ the currendy availahle ones to protect 
targets running different applications. For example, one could employa Spam 
BadHood blacklists to also protect from SSH or Windows Shares attacks. 

We have seen In Chapterg|and hi Chapter |6|that the size of BadHood black- 
lists varies according to the application te.g., spam blacklist have usually more 
entries than SSH and phishing counterparts). However, whedier we can Rnd a 
significant number of BadHoods m different hiacklists is still unclear - which is 
the object of study In this chapter. 

liking this hito account, we raise the followhig research question: 'Are the 
jame Bodffoods nsponsibk far carrying o^it attacks id diffirttit applicaanm an the 

To answer this research question, we have lirst chosen data seta containing 
IP addresses found cartyhig out attacks employing different applications. After 
choosmg the data sets, we have carried out the measurements, and obtained 
a dally snapshot of each data source for a week period. Next, we have gEn- 
erated individual blacklists containmg the IP addresses of the attackers, and 
aggregated it into /24 BadHoods. We have chosen /24 hecause this is the pre- 
fix that Incurs less aggregation error (as discussed In Chapter!), ^"'^ '■' 
the smallest prefls that can be "routed" on the internet ||^, After dial, we 
will compare the generated BadHood Blacklists employing the same method- 
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diffMent sources. ^ 0 ^ 

application-qieciEic data sets we have used to geneiate the BadHood blacHists 
for Iha various evaluated applications, !n Eecdong we evaluate the data sets 
employttiE the comparison methods described in Section|6]2| Finally, in Section 



7.1 Blacklist Sources 

In t his sec tion, we present the data sett that a te spedHc for spam CCBL: Sec- 
tion gTl}, phishing (Phishtank, Sectton|7X2j, and firewalls logs having mul- 
tiple applications (DShield, in Section |7.13r ~^ 

Pot the three data seB, we have collected data fbt a one week period (Novem- 
ber 11th to 18th, 2011). Then, we have generated a shigle list of /32 IP ad- 
dresses for each data set, Subsequendy, each blacklist was aggregated into a 
/24 BadHood blacklists. 

as for Chapter^ as described in Section |gTll (i) moidtorad appHcaUons, (iO 

of access. We have focused on data Bets that would provide lists for different 
applications, that have been used in different research worts, and that can "be 



7.1.1 Composite Block List (CBLJ 

As descnted in Chaptetg Composite Block List ICBL) is spam blacklist, which 

dedicated to developing and maintaining an anti-spam and anil-vlrus DNSBL of 
die highest possible quality and reliabOity, that large organizations can use with 
conlldence" Ml- It lists /32 IP addresses that have reached dieir spamtraps. 
The number of trap! and their location la not disclosed, but it is distributed over 
different networks and countries, CBL has teen employed in a number studies, 
including J^ifiH El EH 
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Listing 2 Sample of nSliidd [jir Filpi 




7.1.2 Phishtank 

ing websites" II1Q6I . It provides a blacklist of URLs that contain forged websites. 
Since we need IP addresses Instead of UBLs to proceed with our analysis, we 
have obtained tliis blacklist and resolved all the URLs to IP addresses using 
Google Public DNS C53. In case of a URL was resolved to multiple iP ad- 
dresses, we have considered all of them. 



7.1.3 DShield 

As described in Chapter|51 DShield imi is a coramtinily shared firewall log 
system, Volunteeis submit their firewall logs from more than 600 contribntors, 
which encompass more than "SOD.OOO IP addresses (firewalls) m over SO coun- 
tries" dSJ. II is mainlained by the SANS Institute HlgSI . and contains security 

research comnnm^ In several research works, including |f?gl|ISiin[5Sl. 

An additional advantage of using ihe DShield dataset is that it provides log 
flies for attacks belonging to many appllcadons — differently, for example, from 
blacklists like CBL lM|, that only list spammtag IP addresses. 

Listing m shows a sample of a DShield log file (field names were changed to 
flt in die page). As can be seen, the file is aggregated over the source IP address 
of the attacker (Souxca ip) and destination port (Port). Iiii example, the first 
IP address in the list C62,4.71.237) has employed TCP (Prato - 5, from lANA 
Internet protocol numbers 11621 to attack port SOB 0 for 78.261 times (Occur.). 

GMT. 'ihe date can be Inferred^om the file's name (DShield provides one file 
per day) . 
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We have then flUered the Dshield flies to keep only the information we 

enliies and generated unique entries {Source IP, Pott Number, Proto^^ol). 
For the mnnitorlns period, we found 4,978,729 different entries. 

T^ble |7l| provides an overview of the DShielddataset. As can be seen, we 
have found more than 2.8M single /32 IP addresses cairying out attacks. On 
averaKC, these individual IP addresses have misused 1.72 different applications 
(^). On the other hand. /24 BadHoods were found misusing 2.31 applica- 
tion!, in average, while having 2.29 malidous ( IP addresses per /24. 

Figure |7.11.a)| shows t he distr ibudon of the number of disrincl BadHood- 
s/Proto/Port, while Figure |7.1Cb)l shows its respective CDF. As can be seen, the 
vast majoritj- of /24 BadHoods (2,620,153, or 39%) have been observed at- 
tacking using a single application only. These result! from DShleld suggest that 
tfte majoritf of BadHood are applicotion-spedfie; however, in the nest section, 
we will investigate if these flndfags still hold when comparing BadHoods from 
different data sources. 

Since DShield provides data for more than IDO K types of applications, we 
chose a subset of these for our analysis. We have ranked the most frequently 
attacked applications (Port and Proto fields) in terms of number of attacking 
IP addresses. Table^shows the top 20 applications in this list, including liieir 
description. As can be seen, most of the attacking IP addresses target Microsoft- 
DS active directory {Port 44S). 

In addition, many entries did no list any protocol and others used high port 
numbers (unassigned). Therefore, we have focused only on attadis on the 
■Veil-know ports" (port number < 1024, according to lANA terminology and 
list EUl) that have the protocol field (Proto) different from null. By filtering 
out such entries, we filter out potenrtal false positi ve en tries found In DShleld 
data set, and focus on the most repeated ones. Table |7^shows die Top 10 ports 
according to these aiteria. 
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From the top 10 ports shown In T^blejTia) we have chosen the top 5 ports 
to carry out our experiments Cexduding Telnet'], plus a high port having most 
pf the attacks (5S59) from Table |^ Therefore, six ports from DShleld were 
chosen: TCP 445 (T-445), UDP 5559 01-5559), TI^P 25 (T-25), TCP 443 (T-443), 
TCP 80 (T-ao), and UDP 53 (U-S3). 



7.2 Experimental Evaluation 

scribed in Chapter''|5| In Section |7|zT) we evaluate 'the BadH ood blacklists 
according to the their distribution, aTHescribed in Section |011 Then, inSec- 
Hon 17:1:21 we show the BadHoods Intersection, as covered in Section |5X5| 
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7.2.1 BadHoods Distribution 

SimDat to Section |0?T1 we start by comparirg the number and distribution of 
malicious hosts over the IP address space for diffcreni BadHood blac^dists. 

TShle ^presents the results. In this table, we show, for each application, 
the number of observed BadHoods and the statistics on the number of hosts 
per BadHood, Analyzing this table, we can see that the number of observed 
BadHoods changes considerably according to the application considered. For 
example, CBL (a spam datasel) has exhibited 550 times more BadHoods than 

from the the'same source (DShield), we can observe port 445 BadHoods (T- 
445, Windows Shares) were 16.57 times more frequent than http BadHoods 
(T-80). 
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7.2.2 BadHoods Intersection 

This subsection focuses on revealing the percentage of intersecting BadHoods 
between two BadHood bladJisls. 

TiblelJII shows the results. Note that, for two blacklists, we only compare 
the one which has observed less hosts to the one which has observed more 
hosts, since we want to compare what is die intersection of a smaller BadHood 
blacklists to a bigger one. In Table [TTj we show the number of BadHoods 
that were found intersectmg between two applications BadHoods blacklists; the 
percentage values refer to the total number of marching BadHoods divided by 
the number of entries observed by the line source. As an example, consider 
Are second row and second column. It is to be interpreted as follows; of all 
BadHoods that have attacked using UDP Port SSS9 (U-SS59). 29.8% were also 
found attacking TCP 445 application [T-445). 

Analyzing this table, we can observe that, for only two cases (U.SSSP and 
T25, bodi against CBL) we have an intersection rate above 90% (in relation to 
II.SSS9 and T.2S data sets sizes). That means that more than 90% BadHoods 
diat carry out attadts on port 5559 and on port 25 also cany out spam attacks 
(we would expect such a high rate for T-ZS, since it monitors the default SMTP 
port), however, portUDP 5559 is not assigned by lANA, which means no official 
application is supposed to run in this port. 

However, for the rest of the applications, we can see the matching rate be- 
tween any two data sets is below 51 %, being the majority below 30%, These 
are very low values if one mtends to use BadHood blacklists from one appli- 

tions tVom what we have found in Section rn3| in which we have broken down 
the DShield data set according to the application, and found that most of the 
BadHoods (89.8%) have carried out attacks employing only a single application. 
Therefore, what we can conclude is that, for moat of the cases, the BadHoods 
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attacking two different applications differ, and (fiffejnre il is neccisopy ro cany 



7.2.3 Correlation 

ing BadHooii blacklists. To do this, we have chosen a subset ol^ application 
blacklists which have presented a intersecting rate of at least 90%, as shown m 
Table 17^ 

Figure |7!2M1 shows the scatter between CBL and U-5559. In this figure, 

refers to the number of hosts that the particular EadHood has used to attack 
CBL, while y refere to the number of hosts employed to attack U-SS59. As can 
be seen the majority of points ate below the green line [which would Indicate 
that the they employ a similar number of hosts) . That means that most of the 
BadHoods ohseived in the intersection have attacked CBL using more h osts in 
comparison to U-5559 - which Is confirmed by the li values (see Section 
analysis shown in Figure \7.2(b)\ In this figure, we can observe the frequency 
of the numbe r of hosts of Intersecting BadHoods between CBL and U-S569. 
Figure |73m1 shows that the majority of intersecting BadHoods have attacked 
CBL using a larger number of hosts than U-SS59 (A > 0). 

These results may also be influenced by the number of monitored IP ad- 
dresses each data set source has, since increasing the number of monitored IPs 
may increase the odds of being attacked by more hosts. However, this informa- 
tion is not provided by those datasets: CBL does not disclose the number of IP 
addresses they monitor, and DShield provide aggregated information in relation 
to the source IP addresses [attacking IPs). 

The same conclusions can b e obtained from an alyzing the case of CBL and 
T-25 - as can be seen in Figu[es |73Ta^|73M and [73^1 



7.3 Conclusions 

In this chapter we have compared BadHood blacklists obtained fi-om various 
applications. The goal was to determine if different types of Internet attacks 

To answer this qucsdon, we have obtained representadve data sets contain- 
ing offending IP addresses from various applications, for the same monitorbg 
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compared Ihem using the methodology describe in Cliapter|S| 

In our analysis, we found thai the number of offending BadHoods varies sig- 
nificanfly according to the application (by a factor of up to SSO). The explana- 
tion for this variation lies with tlie specifics of the applicaiiDn being exploited for 

□niine sales of counterfei^iUegal pharmaceutical products is heavily based on 
spam, and it is a market iar trom being saturated. By analyzing leaked data 
^m two pharrxiaceuticBl operations, they have shown that, on average, 1,SQQ 
end 3.S00 new clients are attracted by spam campaigns every week - Which 
makes spam proGtable, and dierefore, we can expect it to continue. In contrast, 
phishing attacks have a different business model that does not rely upon a mas- 
sive numljer of bidividual IP addresses as spam. The difference between the 
spplicstton's business model is therefore reflected En their respective BadHoods. 

Moreover, our results have shown that for only two cases [out of 49], we 
found two BadHood blacldiats having an intersecting rate above 90% (w.r.t, die 
smallest blacklist) . The cases were when we compared CBL (a Spam blacklist) 
to DShield's TCP Port 25 attacks CT-25) and DShield's UDP Port 5559 [U-5559) 
attacks. For the first case.this could be explained, since both are related to 
Spam. Since UDP pott 5559 is not registered, we cannot tell if diere is anyrela- 

(exduding CBL) wae below^ i41.0DO-^hich is equal tn 1.4% of the maximum 
thcoredcal /24 BadHoods of die IPv4 addressing space. For diese two particular 
cases, we also found ihat CBL has been attacked more much often by a larger 

sources were compared. 

The implicatians of our results is that Internet BadHoods should be applica- 
tion tailored - which supports the BadHood definition we have introduced in 
Chapter|l]ahd the findings in Chapterl?) Therefore, we can conclude that s 
network administrator should employ applicadon-specific BadHood blacklists. 
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CHAPTER 8 



Bad Neighborhoods Temporal Attack 
Strategies 
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And how long does it take for most of them to strike again? The answer 
can be used to develop models that predict attacks from BadHoods, based 
on tiistorical past. 

The rest of this chapter is divided as follows. In Section^ we cover ihe 
data sets used in this chapter. Nest, in Section |s;2| we address RQ 8.1, while 
in Section|0]we address RQ 8.2. After that, we address RQ 8.3 in Section |s7| 
and the conclusions are presented in Section |T] 

8. 1 Evaluated Datasets 

• Aprn 2010: from 19lh to the 2Slh (8 days) 

> November 2011: from November 11th to the I7th [7 days). 

For the monitoring period, we have collected data from three data sources. We 
and can be found in Sections O and ED ^ 

> CBL Spam blacklist (CBL) (Eg (See also in Section [gXT) . 

• UT/EWI (UT/EWI): obtained from analyzing the spam filter logs from 
Twente, 

• DShield data set UST ]: DSh ield is a community shared firewall tog system 
(See mote in Section |7.1 .St . We have chosen to focus on two of the most 

- TCP 445 CT-445); Mierosoft-DS Active Dfrectory and/or Windows 
shares tflSl . 

- TCP 3339 fT-3389): Microsoft Tfermlnal Server (HDP) IHftSl . 

a /24 BadHood blacklist, liiese BadHood blacklisK were Chen employed to 
answer the research questions presented in the introduction. 
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8.2 Daily Number of Bad Neighborhoods 

In this section we investigate whether the number of Internet Bad Neighbor- 
hDods [hat a target observes changes for dlffereQt days [EQ 8.1). 

We have several reasons to expect that the BadHoods distribution over dif- 
ferent days is far from being static The main one is a consequence of the 
behavior diat uidividual hosts (732) exhibit, trying to be as stealthy as possi- 

in Appendix For example, Figure |8.1(^ shows the daily number of unique 
spammers (/32 hosrs) for UT/EWl throughout November 2011. As can be seen 
the values range from less than 2 0K to m ore than 120K individual hosts per daj; 
over a period of 24 days. Figure |S.10i)| shows the daily variations for the CBL 
fSot bladdist, which also exhibits a variation for the monitored days (please 
notira the difference between the y axis scale of both figures). 

time is that DNS Blacklists E3l. such as CBL and PSBL ES), which contain 
many malidoUB /32 IP addresses, have to be constandy updated in order to keep 
up with the dynamics of individual hosts and be effective in the mail tillering. 

liking these into account, we then proceed to the analysis of the dataaeis 
employed In this chaptei; Tahh presents the dally number of BadHoods, 
for each uidividual dataset. As we expected, for all data sets, the number of 
BadHoods changes on a daily basis, in addition to that, we observe thati 
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Tabk8.1: Niimb«Df BadHoods/iUy 



UT/EWI and DShield data sets ff-445 and T-3389) than ^ CBL (Nsi 

divided by the day having the least entries, or 100 x Max/Min. 

• Abrupt variations can occur, as can be seen between 1st and 2nd days of 
T-3389 (November 2011). 

We address these observations in detaLs in the nest two subsections. 



8.2,1 Variation between the Datasets 

UT/EW[ and DShleld datasets than CBL has to do widi the way each original 
blacklist is generated, UT/HWI and DShleld datasets are generated based only 
on attacks observed on a single day, that is, all/32 entries they list correspond 
to, at least, one attack observed on the very day. 
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(a) CBL- NoTOmbn 2011 th) UT/EWI- NnVemba aiU 




8.3.1 The Bad Neighborhood Occurrence Score 

The results presented in the previous section show the number of days a BadHood 
is active for the monitoring data sets. However, it does not show which difys of 
die monitoring period are chosen hy the BadHoods. Pot example, 2 days could 
he a combination cif any 2 random days within the monitoring period. 

having n days^of data, we define, for each /24 BadHood (S^^), an occurrence 
score as follows^ 
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In This equation, i refers to the day that [he particular BadHflod tS'-') is active. 
i may vary from 1 (first day of the monitoring day, not necessarily the day of 
the month) until n, the last day in the observed data set. The final occurrence 
score Is the sum of 2' for each nday jS'"* Is active. In the end, the final number 

the n days a certain BadHoods carried out attacks. 

To better illustrate how the occurrence score Is calculated and decomposed, 
consider the AprU 2010 data set from UT/EWI. Jable shows an excerpt 
of the final BadHood score file thai was generated after scoring BadHoods for 
the monitoring period. For each BadHood, an occurrence score is provided, 
calculated using g3| As shown in this table, a score of 96 can be decomposed 
into two tenns. The power of each of them (S and 6) represents the days the 
BadHood was active: 5th and 6th of the monilormg period. These, ui turn, 
represent April 23rd and 24th. 




2'+' implies that the BadHood is active on the ;-th day plus any previous day(s) 
(i' < i), but never on any days > i. Rir example, a score of 32 means that a 
BadHood is active on the 5ih day However, there is no other combination of 
days that would yield to a score > 32 and < 64 that would not Include the 5th 
day Bar example, if a BadHood is active on days 1^, it's final score is 30, which 
Is smaller than the occurrence of a single day alone (Sthday - 32). 
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Occurrence Scores Distrfbucion and CDF 

Figures |3| and g;5| show both the distribution and the cumulative distribution 
function (CDF) of Ihe occurrence scores [ left a nd right columns, respectively), 
for the AprU 2010 datasets, whDe Figures |]7jand(0|show the resulla for the 
November 2011 dalasets. 

AnalTzing the figures, we can observe that, with Che exception of CBL, no 

the exception of CSL, all the other data sets observe small spikes on scores equal 
to 2', which are BadHoods that have only attacked on a single day CBL, on the 
other hand, presents a significant spike on score 510 (a scor e that represents 
all previous days), as eipected ftom Figure and |a:4Ml which is due the 

What we can conclude from our analysis is that, except for CBL, ihers is 
no day or a comhiiintfon of days that is Hgnifkantly more rsmrreni than others. 
Therefore, our results show that a network administrator should not expect any 
pattern ot regularity in terras of virtiich days BadHood chose to attack - which 
makes the task of predicting attacks more compleXr 



8.4 Tracing Back BadHoods: Time Since Last At- 
tack 

From Ihe previous results, we observe that there is no particular combination 

Therefore, in this section, we focus on a single day of the monitoring period 
instead of all the monitored days. We single out the lastdey and sciutinile each 
observed BadHood, in order to determine if they can be traced back to any 
previous days. After that, we determine how many ''■O'S have passed since the 

To do that, we have carried out a thiee-step approach. First, we obta in al l 
the /24 BadHoods of die last day of each data set (as covered in Section |0>. 
Then, for each of them, look it up on the final occu rrence score file generated for 
the whole monitoring period (as shown in Section lsXlt . Those BadHoods that 
have been observed carrying out attacks in the last day in combination with any 
of the other previous days (in any combination) are filtered. Mathematically, 
this means that we have only considered BadHoods having an occurrence score 
larger than the threshold e > 2', In which; Is the nnmber of monitoring days for 
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each data set. R>r [he April data sets, e is equal Co 256 and 128 for November 

In our case, we are interested in rhe last i' day thar a BadHood Jt^' has 
been acttue (the day before the singled out day), lb illnstrate this, consider 
rhat a certain BadHood from UT/KWl (Aprtt data set) has a score of 262. By 
decomposing this number into powers of two, it reveals that this BadHood has 
been active in days 8, 2, and 1 [262 = '2* + 2= + From the days it was active, 
we compute the difl¥Tfnce beaieen thelostday (8) and Iheday tighlbejbreit (2), 

THble^3]shows the number of BadHoods on each data set, and the percent- 
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1 poiicy), t/ie majority of 

Sailhooas dial attack a targel have also Been observalin al least one of the previ- 
ous days. For the April 2010 data sets, that means chat 65-89% of all BadHoods 
observed in the last day are likely to have been observed on all previous days 
(7 days), while for the November 2011 darasets, 73-80% of BadHoods observed 
on the last day are likely to also have been active on all previous days (6 days). 

Then, the next step wasto determine when each of the reoiirent BadHoods 
was last observed. Figure shows these results as a cumulative distribution 
function [CDF), As can be seen, for all thedata sets, themajority of the recurrent 
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Anathsr inieresting fact that can be obaetved from these resuits is that, for 
all the data seta, at least 85% of the recurrent BadHoods are observed within 
the last five days, which is valuable information to determine how many dayj 
should be considered to bufld BadHood attack prediction models. 



lad NgjgliborhoQds Temporal Attack Straiegies 



Cs)t»lil>ddT-445 



(b) DSWddT-MS CDF 
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8.5 Conclusions 

researdi queslions'and evaluated distent teal world' da 
In RQ 8.1, we asked VJiHt js the daily variation in I 
BadHnods?". We found that, for all the evaluated data! 
number of active BadHood! on a dally basis. In addltit 
the Spam Blacklist CBL shows a much smaller proporcic 
other data sets, mainly because CBL keeps mali 



we also foimd that 
hosts In their blacklists for 
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This confirms tn 



'^ne-size-fics-all" ternporal prediction model. Moreover, Che usefulness of the 
recent historical past has been proved in RQ B.2, since iip to 95% of BadHoods 
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9.1 Summaiy of Contributions 

cenlrated in certain pordons of the IP address space gJI IMiES SSI SI. This 

concenlraiian of malicious hosa, ai voriOui ussrasation levels - which is the main 
contribudon of this dissertation. 

By flnt framirg such concentradDn of malicious hosts as Internef Bad Neigh- 
borhoads, we have pul the Bad Neighborhoods CBadHoods) under seniliny, in a 
inullifaceted waji Figure 5hows|9.1|shows the BadHoods iacet! that were inves- 

(RQ). raised in SectiDn|rg * 

For RQ 1 ("what are the characteristics of Iniernet Bad Neighborhoods?"), 
we have proposed, in Chapter|TJ a dejinition for what a Bad Neighborhood is. 
Nest, in Chapter|2) wa have presented three assumptions for die occurrence of 
BadHoods {"why" in Figure |^. Following that, we have investigated inChap- 
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tetg]how malicious IP addresses can be aggregated into network preliKes (/24- 
/8, in CTDR natation ED). Later, in Chapter^ we have then evaluated our 
assumptions for the esisrence of BadHoods proposed in aiapler|2]and shown 
in which Internet Service Providers, countries, and dties BadHoods are (ocaied. 
Then, in Chapter || we have carried out a case study on spamming badHoods, 

[iom $10 bDlion to $87 billion yearly E!D. 

In RQ 2 ("TVhich blacklists should a network administrator choose to protect 
a network against attacks from Internet Bad Neighborhoods^, in turn, we have 
aasnmed the point of view of a network administrator who employs BadHood 

blacklists sourcfj a network administrator should use - public, peers, or local 
measurements. In Chapter^ we have addressed the question whether a net- 
Work administrator can employ BadHood blacklist obtained for one appUcation 
(e.g., mail) to protect against attacb to other applications (e.g., ssh). Finally, 



BfldHnods lo determine how often blackliils should he updati 

9.2 Main Findings and Implications 



network preflxes (e.g., /24), but also at different and coaiser aggregation levels, 
such as Internet Service Providers (ISPs) and countries. 

As shown in OiapterQ tlie top 20 Autonomous Systems (ASes), which are 
someliow comparable to ISPs, concentrate almost 50% of all spamming IP ad- 
dresses observed in our data sets, from a total of 42,201 active ASes in our 
analysis. In the worst case, a smgle ISP from India (ENSL, AS number 9829) 

In our dataseis. Moreover, when considering the ratio of raalidous IP addresses 
in an ISP [number of spamming addresses divided by the number of aimounccd 
addresses), we found tliai some ISPs have an alarmtag ratio of up to 62.55% 
of their announced IPs sending spam (SpectraNet, AS Number 3734Q, an ISP 
from Nigeria). These results confirm the avislence o/Bad WeigftljorfiDDds at the 

Also, we found that BadHoods are concentrated in certam countries. For 
the case of spam, even though we found spamming hosts all over the world. 

found having spamming hosts, a single one (India) was found concentrating 
almost 20% of worldwide spamming IP addresses, followed by Viemam and 
BrazQ (-7% each). In total, Ihe lap 20 countries were, responsible jbr 76,31% of 
aU the spamming IP addresses. These results also confirm that certain countries 
concentrate most of malicious spamming IP addresses. 

These finduigs advance the stale of the art by showing that malicious hosts 
are concenttated not only In certain portions of the IP address space HHEl 

HS En, but more clearly at higher aggregation levels, such as ISPs and 
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ployed as an aiHiliOjy approach to evaluate traffic from unknown sources. Our 
results do no support, however, that a country or AS should lie entirely black- 
listed; instead. BadHood blacklists at such coarse levels should be employed 

filter, such as SpamiaBBBHln (Sg , AS-based and Country-based BadHood black- 
L'sis can be used to compiement current filter rules when scoring the likelihood 
a message being spam, or even consider network prefixes smaller than /24, us- 
ing the algorithms proposed in Chapter 0 The main advantage of BadHood- 
based solutions is that we can avoid looking into the contents of an e-mail 
message, wliich is t^ically employed in Eayesian spam filtering teflmiques 
CSattMllHSl. Such techniques are more CPU uitensive than simple IP/ASN 

Another hnplicalion of these Hndmgs is that it makes it "easiei" to tackle the 
problem of malicious IP addresses on the Internet, hy "nipping the problem in 

support that a "clean up" on networks m ISPs and countries having higher con- 
cenlradon of malicious IP addresses would be more effective. Such measures 
can be also supported fiirough specific legisladon - similar to die United Slates' 
CAN SPAM act gU and European Union's Directive on Privacy and Electronic 
Communicadons (2002/58) even though we have shown in Secrion |0] 
that legisladon alone may be not sufficient (five out of the top twenty high 
volume spamming BadHoods are located in the European Union), 



9.2.2 Bad Neighborhoods May Vary Accoriling to the AppUca- 
tion Exploited 

is more Ukely to liave higher concentration/incidence of vorious crimes, such as 
robbery, car dieft, etc. On the Internet, however, that is not the case: BadHoods 

In Chapter 0 we have shown that whLe spam is distributed all over the 
world (but concentrated in Southern Asia), phlshing Bad Neighborhoods, on the 
other hand, are mostly concentrated In the United States and other developed 

specifics of spam and phishing. Most of spamming hosts are part of an army 



of "hijacked" malicious hosts [part of botnels), typically at home, schools and 
businesses with no availabiUty guaranteed. Phishing hosts, on the other hand, 

the phishing site shoidd be accesslb] 



more likely to be hosted on reliable infra 
providers, which, in turn, are mostly lo 
the United States. 

hi addition, in Chapter|7| we found that 11 

the set of attacMng BadHoods were aknc 
Neighborhood! are application-specific. 

The impUcations of these findings is that security systems employing BadHoi 
based techniques should employ applicatlan-spedflc EadHood blacklists. In ad 
dition, research work aiming at predicting attack sources, such as the work b 
Soldo etaL CSI. can be optimized by taking into account the appl[cation em 



9.2.3 Bad Neighborhoods are Likely to Attack Again 

In Chapter H we found tfiat the number of /24 BadHoods lin CIDR nota- 
tion ES]) that attack individual targets varies daily. After being attacked by 
a particular BadHood, however, a netvuork administrator might wonder if the 
same BadHood will return to attack the same target again, and if it will, when 
can it be expected. We found in Chaptetgthat 40-9S?* of all BadHoods are 
likely to strike a target more than once, depending on the applicadon/dataset. 
Within a week period. However, there were no particular combinatitMis of days 



This Ending highlights the benefits provided by employing the Bad Neigh- 
borhood concept. For example, m Chapter|D we found that, in a one week 
period, 46.94% of the individual IP addresses attack only once, part of their 
stealth lactic ("flying under die radai^, as discussed in Appendix^, 

We also found in Chapter g that, by singling out all /24 BadHoods diat 
attacked a target in a particular day, 66-89% of these /24 BadHoods have also 

the target/application in question. 
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attacks shoiild be employed to predict fiintre al:lacks. 



9.2.4 Public Blacklist Sources Allow Better Detection Results 

administrator can employ blacklists from third-partiea or cany out local mea- 
suremeQCs to generated local blacklists. In Chapler |6| our results have shovm 
that It b better to employ blacklists obtained from public third-party sources, 
instead of carrying out local measurements. 

This is due to fact that specialized public sources typically employ a large 
number of distributed monitoring points. Conseijnently, these sources are able 
to capture more malicious hosts -which typically try to employ a stealth "under 
the radar" attack strategy -as discussed in Appendix 0 Therefore, a network 
administrator can protect a network by "anticipating" BadHoods blacklisted by 

public blacklists In order to verily their quality before applying it. 



9.2.5 "Silent Ticking Spam Bomb" in BRIC countries 

Another Ending in this dissertation is that there might be a "silenl ticking spam 
bomb" m the BRIC countries (Brad, Russia.lndia, and Chma). Currenfly, these 
countries have a moderate Internet penetration (Brazil - 40.6%. Russia - 43.0?*, 
India - 7.5%, China - 34.3%, World Average - 35% 03S1) that is expected to 
grow between 9% to 1S% yearly, according to a Boston Consulting Group report 
II 1 ll .driveTi because of their economic growth, Thegrowfii,per4e, is a positive 
achievement for the countries and their population, since "exdusion from it (the 
Internet) Is one of the most damaging forms of exclusion in our economy and 
culture" im , as stated by the sociologist Manuel Castells. 

However, a problem might emerge if the ratio of malicious IP addresses in 
these countries remains stable while the itumber of Internet users Increases. In 

sources. To illustrate this, consider India, a country that ranks first ui number of 
spamming IP addresses. If India would have the same Internet penetration rate 
as the United States (a developed country comparable in size) while keeping its 
current ratio of malicious IP addresses, that would cause an Increase of 200% 



9.3 Moving Forward from Findings 



As described in Section [L2) the goal of tliis dissertation was to scrutinize tlie 
Bad Neighborhood phenomenon on the Internet to better understand its in- 

fctindations in which BsdHood-bssed security solutions can be buDd upon. 

Therefore, the next natural step 15 to employ the knowledge provided in this 
dEssertadon in security solutions. TVaditional /32 IP addresses blacklists try to 
protect a target from attack sources based on historical pasts. BadHood-based 
Boludons can be seen as a step further in this process, by both protecting from 
previous sources but also by predicting sources (neighbors) of attacks, as we 
have shown in Section|l7l 

In AppendiiH, we evaluate a BadHood-based mail filter based on the find- 
ings obtained in Chapter|6| We implement an algorithm that uses as parameter 

However, the major direction we envision for BadHood-based research is to 
develop algorithms that combine not only findings from a single chapter, but 
algorithms that build upon the findings presented in the entire dissertadon. To 
mention a Pew, we have seen in Chapter |5j that employing BadHood blacklists 
of third-party sources leads to better detection results, while in Chapter|7|we 
have seen thai BadHood blacklists should be application-specific, hi addiiion. 
we have provided two aggregation algorithms in Chaptergand seen that the 
coarser die aggregadon criteria, the larger aggregation errors. Moreover, in 
Chapter m we have seen 40-9S% of BadHoods are h*kely to strike more than 

Chap'lerg AE-bLed and coratry-based BadHood blacklisis shoSd be employed 
hi the process. 

The nest step is therefore 10 combme all the findings and evaluate the re- 

hi addition, as discussed in Section |l3 this dissertation has covered IPv4- 
hased BadHoods. With the increasing adoption of IPvS. we can expect more 
attacks from IPv6 Bad Neighborhoods [currendy IPvS traffic accounts for less 



C| an aggregarion-sryle approach like rhe TntemeT Bad NeEghboi^ioods approach 
is a necessity [for scalability purposes] when dealing with IPv6 attacks, due ic 
die way that IPv6 addresses are allocated. However, further investigation will 
be necessary to confirm if the findings provided in this dissertation hold for IPvS 
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APPENDIX B 



The Rise of Botnets 



areperfomiEd on the Internet While in the past most of the attacks Drigliiated 
fromsingif compromised sellers I1I7S1|TZIJ, a significant part of currem attacks 
comes from dlstrifiuied compromised machines, part of the so-called booiejs 

EiiEa. 

server farms, open relays, or compromised servers II 71 1 . In fact, back then 
spamminswasnotevenacrime; only in 2003 the CAN-SPAM Act made spara- 
ming lUegal in the United States To tight spam, several techniques were 
developed, and the advent of real-lime blacklists containing the IP addresses of 
those spam sources became effective |I5J . 

MeanwhQe, also in the first years of the last decade, broadband technologies 
such as ADSL increased home broadband adoption all over the world 111751 . As 
a consequence, home computers were left otiline for more time, while having 
increased bandwiddi in comparsion to the old dial-up access. 

In this context, spam gangs realized that they could Improve their strate- 

servers that Increasingly became less effective due to the advent of real-time IP 

hosts" J2). Third-patty hosts, in this case, were computers with broadband 
connections at homes, schools, businesses and goveinmenls, running vulnera- 
ble operation systems ISSII . Even though the processing and networking capa- 
bilities of each host was not enough to conduct major spam campaigns or Dis- 
tributed Denial-of-Service (DDoS) attacks, the combined capability of a large set 
of hosts was, which drove to the creation of modem large-scale botnets. Current 
boinels, such as BredoLab, were estimated to have a spam capacity of 3,6 bil- 
lion messages per day, by compromising more than 30 million hosts worldwide 
1T731 . 




"Fly Under The Radar" Attack Strategy 

" of compromised hosts Ml dis- 

zomDies oelongmg to lae Domet Hlui^/Kelihos.B Mi (this botnet was 
later found having more than lOQ.OOO bots). By doing such attacks using zom- 
bies, attackers can hide their real identity and amplify the power of the attacits 
1551 . as described in Section|5:3| 

As explained by BaQey en al, , one of the main problems for botnet herders 
is to spread their worms into other computers, increasing the size of the botnet 
army 03. Many propagation techniques can be employed for tills purpose, and 
current botnets combine many of them to maximize infection. For example, the 
"SDBot exploit! Windows vulnerabOitles, P2P neiworks. and backdoors left by 

^ To cope with the fact that hots are distributed all over the world, real-time IP 
blacklists were developed. These lists contain IP addresses thai originated mali- 
cious activity and are constandy updated in order to keep up with the dynamism 
of the sources of attacks. Such blacklists are popular and are used to fight spam. 

In this cat-and-mouse game, Internet criminals also have developed methods 
to try to circumvent blacHisI-bascd solutions. Since blacklist-based detection is 
reactive- that is, as soon as an attack Is detected, the source is blacklisted and 

paigns by "flying under the radar" - that is, by spamming a mail 
a large number of bots, but sending only one spam per hot. 
As shown ra Chapter 5 (T^blelSl}, we found from mail server logs dial 
46.94% of the spammers have sent only J ipom messflse over a period of one 
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week. Tills, In fact, configures a major problem for source IP blacklist-based 
detecdon sysrems, since the server keeps gerting spam while blscklistin^ sources 



APPENDIX C 



IPv6 Bad Neighborhoods 



Slandardized by the [niernet Engineering Task Force (]ETF), The Internet Pro- 
tocol version 6 (IPve) HO) is a revision of the IP protocol aimed to succeed 
the standard version 4 (IPv4), One of [he main design requiiemenls for IPv6 
was to cope with the well known problem of lack of IPv4 addresses for the cur- 
rent number of devices connected to the Internet, In fact, the last two /8 IPv4 
nelhlocks were allocated by Internet Assigned Numbers Authority (lANA) on 
Ifebniary 3rd, 2011 ITTT^l . 

Currentlji IPv4 still dominates the volume of traffic on the Internet, and 
IPv6 represents no more than 1% HD in backbones such as Intemet2 or at 
the Amsterdam internet Exchange point [AMS-IX)(the weekly average incoming 
traffic in AMS-IX is 525,845 Gbps, while IPv6 accounts for 2.5 Gbps, or 0,2% of 
the total 011), However, we can expectan increase in the volume of iPv6 traffic 
as more IPv6 addresses are assigned. For example, at the University of Twente, 
inwliich we have a fully operational iPv6 network, IPv6 represents 3.2% of die 

With the increase adoption of IPv6, we can eipect mote attacks from IPv6 
sources, as the first reported IPv6 DDoS attacks in 3012 Therefore, in die 

IPv6 BadHoods. 

As shown in (^pendix|D id cope vnth blacklist-based defense software, at- 
tackers started to use more and more Intermedlaiy hosts to carry out attacte, 
so they can relay their attacks through "untainted third-party hosts" |[T5). In 
the IPv4 standard, the IP source address field has a length of 32 Wis - which 
means that theoretically, attackers could use 2^ different IP addresses for their 
attack, or approximately 4 2x10' addresses. On IPv6, the source address field 
was extended to 128 bit!, which means that, theoretically, 2''"' IPv6 are avail- 
able, or approximately 3,4 x 10^" addresses (or 6,67 x lO" IPv6 addresses per 
square meter on Earth). This massive number of IPv6 addresses present a major 
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challenge for bladdis [-based ]Pv6 security mechanisms. 



C.l IPv6 Addressing Architecture 

The IPv5 addressing architecture is defined in the RFC 4291 MS- As covered 
in the RFC, there are three types of IPv6 addresses: 

• UnlcBsti An ideniltier for a single network interfaee (equivalent to IPv4 
unlcast) 

* AnycoslT An identifier to a set of network interlaces. A packet sent to an 
anycast addres is delivered to the ■'nearest" interface of the set, as defined 
by the -routing protocols' measure distance" 11751 - 

Tn this section, we focus on unicast addresses (a sim ilar analysis can be 
conducted for anycast and mulUcast addresses]. Figure^shows the format of 
a IPv6 global unicast address [that is, "roulable- on the taternet). In this figure, 

routing prefix and subnet n>) are used for touting, whQe interface iD is used to 
identify the host hiiertace, 

liking this into account, it is likely that providers will assign end sites (e.g., 
home users) with an /AS IPv6 address {64 bits for globalrouting prefix, 1 6 for 
subnet, and 4H for interface), as described in RFC 6177 PTS) . 

That implies that home users will have at their disposition 2" (~ 2.41 x lO") 
IPv6 addresses to choose from, which is larger than die total IPv4 capacity (i'-). 
As a consequence, a single malidou! host can actually cany out attacks (e.g., 

since he/she has at least 2*" addresses to choose ftoin. 



C,l IPv6 Addressing Architecture 
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ha& received an order to spam a single particular mail server. If Ihis spammer 
is abJe to send 1 jn[llion messa^ps per second to rlie SEune mail sever (which 
is already an absurd number]. It u'lll take almost nine years to exhaust all ad- 
dresses svaiJable within tlie/4B nutllotk. And that is only one spanmiec Indiis 
context, standard IPv6 '126 blacklists cannni rape with the massive number of 
addresses available for attackers to qsi-. 

In tins sense, the Bad Neighborhood approach to malidous networits is fim- 
damental to cope with the vast number of valid IPv6 addresses available. We 
propose the following for IPv6 blacklists; 

1. Employ /48 asthe smallestBadHoodaggiegation netblockfor IP-addiessIng 
based BadHoods; 

fiie/iire necessiiry to cope with it (e.g., 

We believe fliai some collateral damage may occur. For example, if a single 
/64 gei! blacklisted, maybe a legitimate host within the /64 also has to pay die 
price. Other approadies might be necessary to avoid this, but blacking Individ- 
ual /12S hosts will not be eifidear. 



C IPv6 Bad NelgliborlnKidt 
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Country Codes Employed in Chapter 4 



ECAZAKHSTM) 



NETHERLANDS 
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D Counay Codes Employed to Chapter 4 



UNITED smrES 



VIRGIN ISLANDS 



T^bleD.l: Countiy Codes 



APPENDIX E 

Third-Parly Bad Neighborhood Blacltlists 
for Spam Detection 



This appendix contains an ou^erpt of ilie following paper, accepted for publi- 

dscaseu presented in Chapter |S) 

t Moura. G. C. M., Sperotto, A.. Sadre. R., Ptas. A.: Evaluating Third-Pany 
Bad Meighborhood Blacklists for Spam Detecdon. In: IFIP/IEEE Interna- 
tional Symposium on Integrated Network Management (IM 2013). Ghent. 
Belgium, 27-31 May 2013 (to appear) 
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E.l Effectiveness on Detecting Spam 

In ChapcerD we have focused on how specific a BadHood blacklist is lo its mea- 
surement points. The presented analysts allowed to verify if BadHood Bkcldists 

blacklists by measunng the ratio of detected Spam messages. In additton, we 

We begin with a description of the used methodology and the considered 
scenari o in Se ction |ETT] followed by a discussion of Lhe achieved results in 
Section ELa 



E.1.1 Methoilology and Considered Scenario 

analyzing the origin of e-mail messages as wall as die links within die messages 
to malicious websites. One of the criteria used in their approach is whether the 

direshold. Similar to our work, the authors used publicly available blacklists to 
build the list of BadHoods. 

With this Spam detection scenario in mind, we mvestigate here the effective- 
ness of employing different BadHood blacklists to detect Spam messages. For 

fliethreshold-hased criterion described ahove. ^ ^ 

Consider Is as the BadHood blacklist to be used for Spam detection. When- 
ever a new message M arrives, the mafl filter extracts the source /24 netblock 
address of the sender {A//S4) and checks it against the list Ig. If A/^jj isfound in 
ts.then the mail filter will declare the message as Spam if >iJ?imM(A/,;,) > 9, 
where 6 (0 < B < 25SJ can be considered a threshold on how malicious a 
BadHood is. This procedure is summan'zed in Algorithm|3] 

To evaluate the effectiveness of die different BadHood bladdisis, we follow 
the same scenario as described in Figure |0) We regard the mail servers of 
Provider A. UT/EWl, and CAIS/RNP as targets to be protected ftom Spam, We 
apply Algorithm |3lio each targe t r for different values of 9 and for the different 
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Require; 9 
Bequire: M/.^, 

Ensure: true, if spam detected; false, otherwise 
1: if A/y246lsflndnir<"rts3(M/34] >ethen 




blacklists Ls and calcul 



number ol' Span: 

nly use a blacklist to proKc 
blacklist, we will not ; 



E.1.2 Experimental Results 

filtettogdie Spiral dfrectfd UiPtovldet A, UT/EWI. and CAIS/RNP, respectJvelB 
as ftmction of the threshold 6^ using the different blackhstSr The figures indi- 

BadHood blacklists. This is especially true for large blacklists, like CBL^ which 
always provides Ihe highest hitcount. The ligures also show dial the hitcount 
decreases fast widi Increasing values of S, a fact that most likely Is due to die 
presence of high-volume spammers in the data sets. 

A second Insight provided by these tesnlls is lhat the value of 8 should be 
adjusted to the considered BadHood blacklist. For the same B, the hitcount val- 
ues change considerably among BadHood blacklists. At Srst sight, this seems to 
snggest that the best choke for an administrator Is the largest BadHood black- 
However, large BadHood Blacklists might suffers of drawbacks like a high num- 

We investigate therefore if smaller BadHoods can stiil potentially provide 
similar hltcounts for appropriately chosen values of the threshold B, Let Su 
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In Ftgures pTi^ |E.1 M [ and |E.l(f)| we prewnt Ihe hilmimB obtained for the 
different targets using the rescaled fl deiined in Eq. JESt . For die Lcbl list, the 
threshold as indicated on the x-axis is used. For the other lists, we compute the 
rescaled theta according to Eq. (1^, choosmg I„ = Lcbl- 

The figures show that, once the size factor is removed by using Eq. )E-2| . 

fomiance. These results therefore indicate Oiat Eq. jE3J offers an operational 
way for choosing values of 9 for different blacblists such that the blacklists ate 
similarly effective in identifying Spam. In fact, one tnaybe tempted to conclnde 
from these results that all blacklists perform similarly independently of their 
Ifie. 

However, a different picture is obtained when calculating the number of 
legitimate mail traffic erroneously flagged as Spam. i.e.. the number of false 
posidves. Figure |0)!howE die percentage of legitimate mail messages received 
by the maU server of UT/HWI' that are labeled as Spam for varying values of the 
scaled threshold ». While for CBL and PSBL die percentages of blocked Ham is 
IessdianlO%andrapidlyfal]s to zero, for UT/EWl and l=rovider A we observe 
that up to 60% of leglilmate mail would be labeled as Spam If a very low value 
of » is chosen. On die other hand, also in die case of Provider A and UT/EWI, 
the percentage of blacked Ham is decreasing rapidly for increasing values of fl. 

Our results highlight dierefore a trade-off between (i) the sile of die black- 
list, (ii) Ihe Spam hitcoimt and (ill) the percentage of blocked Ham. Very large 
lists, such as CBL and PSBL, achieve a high Spam hltcount widi a 1 
of blocked Ham but contain a laige number of irrelevant entrie 
small and mid-sized lists, that is. Provider A and UT/EWI, con 

largerlists. However, for ScBt < 100, a relatively high number o' 
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